Deontological values and moral trade

[For this post, I’ll assume some familiarity with the concept of moral trade and the distinction between consequentialist and deontological values.]

In earlier work, I claimed that (in the specific context of ECL) if you are trying to benefit someone’s moral view as part of some cooperative arrangement, only the consequentialist aspects of their moral values are relevant to you. That is, if you want to act cooperatively toward Alice’s moral values, then you need to consider only the consequentialist components of Alice’s value system. For instance, you need to ask yourself: Would Alice wish for there to be fewer lies? You don’t need to ask yourself whether Alice considers it a moral imperative not to lie herself (except insofar as it relates to the former question).

While I still believe that there’s some truth to this claim, I now believe that the claim is straightforwardly incorrect. In short, it seems plausible that, for example, Alice’s moral imperative not to lie extends to actions by others that Alice brings about via trade. (Furthermore, it seems plausible that this is the case even if Alice doesn’t consider it a moral imperative to, for example, donate money to fund fact checkers in a distant country. That is, it seems plausible that this is the case even if Alice doesn’t care about lies in a fully consequentialist way.) I’ll give further intuition pumps for the relevance of deontological constraints below. I’ll take a more abstract perspective in the section right after giving the examples.

Related work. Toby Ord’s article on moral trade also has a section (titled “Consequentialism, deontology, and virtue ethics”) that discusses how the concept of moral trade interacts with the distinction between deontological and consequentialist ethics. (He also discusses virtue ethics, which I ignore to keep it simple.) However, he doesn’t go into much detail and seems to make different points than this post. For instance, he argues that even deontologists are often somewhat consequentialist (which my earlier writing also emphasizes). He also makes at least one point that is somewhat contrary to the claims in this article. I will discuss this briefly below (under example P1).

Examples. I’ll now give some examples of situations in which it seems intuitively compelling that someone’s deontological duties propagate through a trade relationship (P1–3). I’ll vary both the duties and the mode of trade. I’ll then also provide two negative examples (N1,2), i.e., examples where deontological norms arguably don’t propagate through trade. The examples are somewhat redundant. There’s no need to read them all!

The examples will generally consider the perspective of the deontologist’s trading partner who is uncertain about the deontologist’s views (rather than the deontologist herself). I’m taking the deontologist’s trading partner’s perspective because I’m interested in how to deal with other’s deontological views in trade. I’ll assume that one cannot simply ask the deontologist, because this would shift all difficulty to the deontologist and away from the deontologist’s trading partner.

P1: Say you have a friend Alice. Alice follows the following policy: Whenever someone does something that’s good for her moral views, Alice pays them back in some way, e.g., monetarily. For instance, Alice is concerned about animal welfare. Whenever she learns that hardcore carnivore Bob eats a vegetarian meal, she sends him a dollar. (There are lots of practical game-theoretic difficulties with this – can vegetarians all earn free money from Alice by claiming that they are vegetarian only to be nice to her? – but let’s ignore these.)

Now, let’s say that Carol is considering what actions to take in light of Alice’s policy. Carol comes up with the following idea. Perhaps she should put up posters falsely claiming that local cows graze from heavily polluted pastures. Let’s say that she is sure that this has a positive impact on animal welfare (i.e., that she has reason to not be concerned about backlash, etc.). But now let’s say that Alice once said that she wouldn’t want to lie even when doing so has positive consequences. Meanwhile, she also isn’t a “truth maximizer”; she argued that the cause of correcting others’ inconsequential lies out in the world isn’t worthwhile. Should Carol expect to receive payment?

To me it seems plausible that Carol shouldn’t or at least she should be doubtful about whether she will be paid. (For what it’s worth, it seems that Claude 3 agrees.) Given Alice’s stance on lying, Alice surely wouldn’t want to put up the posters herself. It seems plausible to me that Alice then also wouldn’t want to pay Carol to put up such posters (even if the payment is made in retrospect).

Here’s one way to think about it. Imagine that Alice has written a book, “What I’m happy to pay people to do”. The book contains a list with items such as “raise awareness of animal welfare conditions under factory farming”, “eating less meat”, etc. Would we expect the book to contain an item, “put up misleading posters that cause people to avoid meat”? Again, I would imagine that the answer is no. Putting such an item in the book is directly causing others to lie on Alice’s behalf. More specifically, it’s causing others to lie in pursuit of Alice’s goals. This doesn’t seem so different from Alice lying herself. Perhaps when Carol tries to benefit Alice’s moral views, she should act on what Carol would predict Alice to have put in the book (even if Alice never actually writes such a book).

Interestingly, Toby Ord’s article on the subject contains a hint in the opposite direction: “[I]t is possible that side constraints or agent-relative value could encourage moral trade. For example, someone might think that it is impermissible for them to lie in order to avoid some suffering but that it wouldn’t be impermissible to convince someone else to make this lie in order to avoid the suffering.”

P2: Let’s say Alice and Bob are decision-theoretically similar enough that they would cooperate with each other in a one-shot Prisoner’s Dilemma (even under somewhat asymmetric payoffs). Let’s say that Alice could benefit Bob but have to lie in order to do so. Conversely, let’s say that Bob could benefit Alice, but would have to steal in order to do so. Alice believes that one (deontologically) ought not to steal, but thinks that lying is in principle acceptable in and of itself. Conversely, Bob believes that one (deontologically) ought not to lie, but thinks that stealing is acceptable.

Ignoring the deontological constraints, Alice would benefit Bob to make it more likely that Bob benefits Alice. But now what if they take the deontological constraints into account? Would Alice still benefit Bob to make it more likely that Bob benefits Alice? Again, I think it’s plausible that she wouldn’t. By lying in order to benefit Bob, Alice makes it likely that Bob would steal. In some sense Alice makes Bob steal in service of Alice’s goals. It seems intuitive that Alice’s deontological constraints against stealing should still apply. It would be a strange “hack” if deontological constraints didn’t apply in this context; one could circumvent deontological constraints simply by trading one’s violations with others. (It’s a bit similar to the plot of “Strangers on a Train”, in which a similar swap is proposed to avoid criminal liability.) (Again, compare Toby Ord’s comment, as discussed in P1.)

P3: The previous examples illustrate the propagation of negative duties (duties of the “you shalt not…” variety). I’ll here give an example of how positive duties might also transfer.

The country of Charlesia is attacked by its neighbor. Charlesia is a flourishing liberal democracy, so Alice considers it an ethical imperative to protect Charlesia. In order to do so, Alice hires a group of mercenaries to fight on Charlesia’s side. Since time is of the essence, the mercenaries aren’t given a proper contract that specifies their exact objectives, rules of engagement and such.

Two weeks later, the mercenaries find themselves marching through the warzone in Charlesia. They encounter a weak old man who is asking to be evacuated. Doing so would distract significant resources from the mercenaries’ military mission. The mercenaries and their equipment are very expensive. So, clearly, if Alice wanted to spend money in order to help weak old people, she wouldn’t have spent the money on hiring the mercenaries. Thus, if Alice is a pure consequentialist, she wouldn’t want the mercenaries to help the weak old man. So if the mercenaries wanted to act in mercenary fashion (i.e., if they wanted to do whatever Alice wants them to do), should they leave the weak old man to die?

Not necessarily! I think it is entirely plausible that Alice would want the mercenaries to act according to common-sense, not fully consequentialist ethics. That is, it seems plausible that Alice would want the mercenaries to observe the moral obligation to help the old man, as Alice would presumably do if she was in the mercenaries’ place. As in the other cases, it seems intuitive that the mercenaries are in some sense acting on Alice’s behalf. So it seems that Alice would want the mercenaries to act somewhat similar to how she would act.

N1: Let’s say Alice from P1 wants to pay Bob to eat more vegetarian meals. Unfortunately, Bob’s social group believes that “real men eat meat”. Therefore, Bob says that if he becomes a vegetarian (or reducetarian), he’ll have to lie to his friends about his diet. (If this lie is not consequential enough to be ethically relevant, you can add further context. For instance, you might imagine that Bob’s lies will cause his friends to develop less accurate views about the healthiness of different diets.) Let’s say that Alice observes deontological norms of honesty. Does this mean that Bob should reason that becoming a vegetarian wouldn’t be a successful cooperative move towards Alice?

It’s unclear, but in this case it seems quite plausible that Bob can benefit Alice by becoming a closet vegetarian. It seems that the lie is more of a side effect – Bob doesn’t lie for Alice, he lies for himself (or in some sense for his social group). So it seems that (according to conventional ethical views) Alice doesn’t bear much of a responsibility for the lie. So in this case, it seems plausible to me that Alice’s deontological constraint against lying isn’t (strongly) propagated through the trade relationship between Alice and Bob.

N2: The inhabitants of a sparsely populated swampland want to found a state. (Founding a state can be viewed as a many-player trade.) Founding a state would require agreement from Alice – she’s the only accountant in the swampland and the envisioned state would require her to keep an eye on finances. (That said, lots of other inhabitants of the swampland are also individually necessary for the state, and only a small fraction of the state’s tax revenue would come from Alice.) Unfortunately, Alice can’t be present for the founding meeting of the state. Thus, in designing the state, the remaining inhabitants of the swampland have to make guesses about Alice’s interests.

After some discussion, over 90% of the resources of the state have been allocated, all to issues that Alice is known to be on board with: infrastructure, a medical system, social security, educatio, etc. The state policies also include a few measures that Alice agrees with and that only a minority of other inhabitants of the swampland care about, including a museum, a concert hall and a public library.

One of the last issues to be sorted out is law enforcement and in particular the legal system. In the past, vigilante justice has ruled the land. Now a formal set of laws are to be enforced by a police. Everyone (including Alice) agrees that this will reduce crime, will make punishments more humane, and will make it less likely that innocents are unfairly punished. Most inhabitants of the swampland agree that there should be a death penalty for the most severe crimes. However, Alice holds that all killing is unethical. In the past, when she carried out vigilante justice herself, she strictly avoided killings, even when doing so meant letting someone get away with murder.

If the members of the founding meeting propose a state policy that involves the death penalty, must they expect that Alice will refuse to perform accounting for the state?

Again, I think that under many circumstances it’s reasonable for the others to expect that Alice will not block the state. It seems that if people like Alice, then founding states (or other large social arrangements) with Alice would simply be too difficult. Perhaps hermits would reject the state, but agents who function in societies can’t be so fussy.

Conceptualization via a model of deontology. I’ll here propose a simple model of deontological ethics to get a better grasp on the role of deontological norms in trade. I won’t immediately use this model for anything, so feel free to skip this section! 

Consider the following model of deontological values. Let’s say your actions can result in consequences via different types of “impact paths”. Consequentialists don’t care about the impact path – they just care about the consequence itself. Deontologists generally care about the consequences, but also care about the impact path and depending on the type of the impact path, they might care more or less about the consequence. For instance, if an outcome is brought about via inaction, then a deontologist might care less about it. Similarly, impact paths that consist of long causal chains that are hard to predict might matter less to deontologists. Meanwhile, deontologists care a lot about the “empty” impact path, i.e., about cases where the action itself is the consequence. For example, deontologists typically want to minimize lies that they tell themselves much more than they care about not acting in a way that causes others to lie. Which impact paths matter how much is up to the individual deontological view and different deontologists might disagree. (I’m not sure how good of a model of deontological values this is. It’s certainly a very consequentialist perspective. (Cf. Sinnott-Armstrong (2009) [paywalled].) It’s also quite vague, of course.)

In this model of deontological views, trade is simply one type of impact path by which we can bring about outcomes. And then the different types of trades like tit for tat, signing contracts, ECL, and other forms of acausal trade are specific subtypes of impact paths. Different deontologists have different views about the paths. So in particular, there can be deontological views according to which bringing something about by trade is like bringing it about by inaction. But I think it’s more natural for deontologists to care quite a bit about at least some simple impact paths via trade, as argued above.

If impact paths via trade matter according to Alice’s deontological views, then Alice’s trading partners need to take Alice’s deontologist views into account (unless they simply receive explicit instructions from Alice about what Alice wants them to do).

Implications for ECL. (For this section, I assume familiarity with ECL (formerly MSR). Please skip this section if you’re unfamiliar with this.)

Do deontological views matter for how we should do ECL? In principle, it clearly does matter. For instance, consequentialists who otherwise don’t abide by deontological norms (even for instrumental reasons) should abide by deontological norms in their implementation of ECL.

That said, I think in practice there are lots of reasons why following deontological views via ECL might not matter so much, especially because even absent the above consideration ECLers might already take deontological norms into account: (Perhaps this is also a reason why neither I nor anyone else to my knowledge pointed out that my earlier writing on ECL was wrong to claim that deontological views of trading partners can be ignored.)

  • Consequentialists typically argue that consequentialism itself already implies that we should follow deontological norms in practice. For instance, consequentialists might say that in practice lies (especially consequential lies) are eventually found out with high enough probability that the risk of being found out typically outweighs the potential benefits of getting away with the lie. It is also sometimes argued that to apply consequentialism in practice, one has to follow simple rules (since assessing all the different possible consequences of an action is intractable) and that the rules proposed by deontologists are rules that consequentialists should follow in lieu of trying to calculate the consequences of all of their actions. If you agree with these sorts of views, then in particular you’d hold that consequentialist ECLers should already abide by deontological norms anyway, even if they only consider the consequentialist aspects of the ethical views they’re trying to benefit. Of course, there might be deontological norms that aren’t justifiable on consequentialist grounds. The propagation of these norms through trade would be relatively important.
  • My sense is that pure consequentialists are rare. My sense is that most EAs are, if anything, more likely than others to abide by standard ethical norms (such as not lying). (Sam Bankman-Fried is commonly brought up as an example of an overly consequentialist EA.) In any case, if you yourself already subscribe to lots of deontological norms, then importing deontological norms via ECL makes less of a difference. Again, you might of course specifically not subscribe to some specific popular deontological norms. If so, then ECL makes these norms more relevant to you.
  • Even without deontological views in the picture, ECL often pushes toward more deontological-norm-abiding behavior. For instance, ECL suggests a less adversarial posture towards people with other value systems.

Some research questions. I should start by saying that I assume there’s already some literature on the relation between deontological views and trade in the ethics literature. In traditional trade contexts (e.g., paying the baker so that she bakes us bread), this seems like a pertinent issue. For instance, I assume deontologists have considered ethical consumerism. I haven’t tried to review this literature. I would imagine that it mostly addresses the normative ethical dimension, rather than, say, the game-theoretic aspects of it. There’s also a question to what extent the moral trade context is fundamentally different from the traditional economic trade context.

An immediate obstacle to researching the role of deontological views in moral trade is that we don’t have toy models of deontological ethics. For trading on consequentialist grounds, we have some good economic toy models (agents investing in different types of interventions with comparative advantages, see, for example, the ECL paper). Are there corresponding toy models to get a better grasp on trading when deontological constraints are involved? Perhaps the simplest formal model would be that deontologists only agree to a trade if agreeing to the deal doesn’t increase the total number of times that their norms are violated, but that’s arguably too strong.

In real-world trades, one important question is to what extent and how a trading partner’s deontological constraints propagate beyond actions taken specifically to benefit that trading partner. The positive examples above (P1–3) consider cases where agent 1 faces a single choice solely to benefit agent 2 and then ask whether agent 2’s deontological views restrict that choice. Meanwhile, the negative examples above (N1,2) involve more complicated interactions between multiple decision problems and between the interests of multiple actors. So, in general what happens if agent 1 also faces other decisions in which she pursues exclusively her own goals – do the deontological views apply to those actions as well in some way (as in N1)? What should happen in multi-party trade (as in N2)? Let’s say agent 1 makes choices in order to benefit both agent 2 and agent 3, and that only agent 2 has deontological ethics. Then do agent 2’s deontological views fully apply to agent 1’s actions? Or is it perhaps sufficient for agent 2’s participation in the trade to not increase the number of violations by agent 1 of agent 2’s duties? (If so, then what is the right counterfactual for what trade(s) happen?) Of course, these are in part (descriptive) ethical questions. (“How do deontologists want their deontological views to propagate through trade?”) I wonder whether a technical analysis can nonetheless provide some insights. 

Acknowledgments. I thank Emery Cooper and Joseph Carlsmith for helpful comments and discussions.

On pragmatist critiques of self-locating beliefs

I assume some familiarity with basic concepts in the area of anthropics (a.k.a. self-locating beliefs or imperfect recall) and decision theory.

A pragmatist maxim of action relevance

Consider the following form of pragmatism, which I think is close to Peircian pragmatism:

The pragmatist maxim of action relevance: You should only ask yourself questions that are (in principle) decision-relevant. That is, you should only ask yourself a question Q if there is some (hypothetical) decision situation where you take different actions depending on how you answer Q.

I am aware that this is quite vague and has potential loopholes. (For any question Q, does it count as a decision situation if someone just asks you what you think about Q? We have to rule this out somehow if we don’t want the maxim to be vacuous.) For the purpose of this post, a fairly vague notion of the maxim will suffice.

The action relevance maxim rules out plenty of definitional questions. E.g., it recommends against most debates on the question, “If a tree falls in a forest and no one is around to hear it, does it really make a sound?”. (It does allow asking the question, “What meaning should we give to the word ‘sound’?”) Importantly, it allows asking normative ethical questions such as, “should we kill murderers or put them in jail?”. By allowing hypothetical scenarios, you can still ask questions about time travel and so forth. Perhaps it has some controversial implications for when and why one should discuss consciousness. (If you already fully understand how, say, a biological system works, then how will it matter for your actions whether that system is conscious? I can only think of ethical implications – if a system is conscious, I don’t want to harm it. Therefore, consciousness becomes primarily a question of what systems we should care about. This seems similar to some eliminativist views, e.g., that of Brian Tomasik.) That said, for every definitional question (what is “sound”, “consciousness”, etc.), there are pragmatically acceptable questions, such as, “What is a rigorous/detailed definition of ‘consciousness’ that agrees with (i.e., correctly predicts) our intuitions about whether a system is conscious?”.) I think the implications for anthropics / self-locating beliefs are also controversial. Anyway, I find the above pragmatist maxim compelling.

It is worth noting that there are also other pragmatist principles under which none of the below applies. For example, Eliezer Yudkowsky has a post titled “Making Beliefs Pay Rent in Anticipated Experiences”. Self-locating beliefs anticipate experiences. So even without action relevance they “pay rent” in this sense.

Successful pragmatist critiques of anthropics

I think some pragmatist critiques of anthropics are valid. Here’s the most important critique that I think is valid: (It uses some terms that I’ll define below.)

If you have non-indexical preferences and you think updateless decision theory – i.e., maximizing ex ante expected utility (expected utility from the perspective of the prior probability distribution) – is the only relevant normative decision criterion, then the philosophical question of what probabilities you should assign in scenarios of imperfect recall disappears.

First, some explanations and caveats: (I’ll revisit the first two later in this post.)

  • By non-indexical preferences I mean preferences that don’t depend on where in the scenario you are. For example, in Sleeping Beauty, an indexical preference might be: “I prefer to watch a movie tonight.” This is indexical because the meaning of “tonight” differs between the Monday and Tuesday instantiations of Beauty. The Tuesday instantiation will want Beauty to watch a movie on Tuesday night, while the Monday instantiation will not care whether Beauty watches a movie on Tuesday night.
  • I’ll leave it to another post to explain what I consider to be a “philosophical” question. (Very roughly, I mean: questions for which there’s no agreed-upon methodology for evaluating proposed answers and arguments.) I’ll give some questions below about self-locating beliefs that I consider to be non-philosophical, such as how to compute optimal policies.
  • Of course, scenarios of imperfect recall may also involve other (philosophical) issues that aren’t addressed by UDT (the maxim of following the ex ante optimal policy) and that don’t arise in, say, Sleeping Beauty or the absent-minded driver. For example, we still have to choose a prior, deal with problems of game theory (such as equilibrium selection), deal with infinities (e.g., as per infinite ethics) and so on. I’m not claiming that UDT addresses any of these problems.

As an example, optimizing ex ante utility is sufficient to decide whether to accept or reject any given bet in Sleeping Beauty. Also, one doesn’t need to answer the question, “what is the probability that it is Monday and the coin came up Heads?” (On other questions it is a bit unclear whether the ex ante perspective commits to an answer or not. In some sense, UDTers are halfers: in variants of Sleeping Beauty with bets, the UDTers expected utility calculations will have ½ in place of probabilities and the calculation overall looks very similar to someone who uses EDT and (double) halfing (a.k.a. minimum-reference-class SSA). On the other hand, in Sleeping Beauty without bets, UDTers don’t give any answer to any question about what the probabilities are. On the third hand, probabilities are closely tied to decision making anyway. So even (“normal”, non-UDT) halfers might say that when they talk about probabilities in Sleeping Beauty, all they’re talking about is what numbers they’re going to multiply utilities with when offered (hypothetical) bets. Anyway: Pragmatism! There is no point in debating whether UDTers are halfers or not.)

Some other, less central critiques succeed as well. In general, it’s common to imagine purely definitional disputes arising about any philosophical topic. So, if you show me a (hypothetical) paper titled, “Are SIA probabilities really probabilities?”, I will be a little skeptical of the paper’s value.

(There are also lots of other possible critiques of various pieces of work on anthropics that are outside the scope of this post. Arguably too many papers rely too much on intuition pumps. For instance, Bostrom’s PhD thesis/book on anthropics is sometimes criticized for this (anonymous et al., personal communication, n.d.). I also think that anthropic arguments applied to the real world (the Doomsday argument, the simulation argument, arguments from fine tuning, etc.) often don’t specify that they use specific theories of self-locating beliefs.)

The defense

I now want to defend some of the work on anthropics against pragmatist critiques. The above successful critique already highlights, to some extent, three caveats. Each of these gives rise to a reason why someone might think about how to reason de se (from “within” the scenario, “updatefully”) about games of imperfect recall:

  1. Indexical preferences. Ex ante optimization (UDT) alone doesn’t tell you what to do if you have indexical preferences, because it’s not clear how to aggregate preferences between the different “observer moments”. Armstrong (2011) shows a correspondence between methods of assigning self-locating beliefs (SIA, etc.) and methods of aggregating preferences across copies (average and total utilitarianism). That’s a great insight! But it doesn’t tell you what to do. (Perhaps it’s an argument for relativism/antirealism: you can choose whatever way of aggregating preferences across observer moments you like, and so you could also choose whatever method of assigning self-locating beliefs that you like. But even if you buy into this relativist/antirealist position, you still need to decide what to do.)
  2. Rejecting updatelessness (e.g., claiming it’s irrational to pay in counterfactual mugging). If the ex ante optimal/updateless choice is not the unambiguously correct one, then you have to ask yourself what other methods of decision making you find more compelling.
  3. Asking non-philosophical questions of what procedures work. One might want to know which kinds of reasoning “work” for various notions of “work” (satisfying the reflection principle; when used for decision making: avoiding synchronic or diachronic Dutch books, being compatible with the ex ante optimal/updatelessness policy). Why?
    • I’m sure some philosophers do this just out of curiosity. (“Non-minimum reference class SSA seems appealing. I wonder what happens if we use it to make decisions.” (It mostly doesn’t work.))
    • But there are also lots of very practical reasons that apply even if we fully buy into updatelessness. In practice, even those who buy fully into updatelessness talk about updating their probabilities on evidence in relatively normal ways, thus implicitly assigning the kind of self-locating beliefs that UDT avoids. (For example, they might say, “I read a study that caused me to increase my credence in vitamin D supplements being beneficial even in the summer. Therefore, I’ve ordered some vitamin D tablets.” Not: “From my prior’s perspective, the policy of ordering vitamin D tablets upon reading such and such studies is greater than the policy of not taking vitamin D tablets when exposed to such studies.”) Updating one’s beliefs normally seems more practical. But if in the end we care about ex ante utility (UDT), then does updating even make sense? What kind of probabilities are useful in pursuit of the ex ante optimal/updateless policy? And how should we use such probabilities?

      As work in this area shows: both SIA probabilities (thirding) and minimum-reference class SSA probabilities (double halfing) can be useful, while non-minimum-reference class SSA probabilities (single halfing) probably aren’t. Of the two that work, I think SIA actually more closely matches intuitive updating. (In sufficiently large universes, every experience occurs at least once. Minimum-reference class SSA therefore makes practically no updates between large universes.) But then SIA probabilities need to be used with CDT! We need to be careful not to use SIA probabilities with EDT.

      Relatedly, some people (incl. me) think about this because they wonder how to build artificial agents that choose correctly in such problems. Finding the ex ante optimal policy directly is generally computationally difficult. Finding CDT+SIA policies is likely theoretically easier than finding an ex ante optimal policy (CLS-complete as opposed to NP-hard), and also can be done using practicable modern ML techniques (gradient descent). Of course, in pursuit of the ex ante optimal policy we need not restrict ourselves to methods that correspond to methods involving self-locating beliefs. There are some reasons to believe that these methods are computationally natural, however. For example, CDT+SIA is roughly computationally equivalent to finding local minima of the ex ante expected utility function.

Acknowledgments

I thank Emery Cooper, Vince Conitzer and Vojta Kovarik for helpful comments.

An information-theoretic variant of the adversarial offer to address randomization

[I assume the reader is familiar with Newcomb’s problem and causal decision theory. Some familiarity with basic theoretical computer science ideas also helps.]

The Adversarial Offer

In a recently (Open-Access-)published paper, I (together with my PhD advisor, Vince Conitzer) proposed the following Newcomb-like scenario as an argument against causal decision theory:

> Adversarial Offer: Two boxes, B1 and B2, are on offer. A (risk-neutral) buyer may purchase one or none of the boxes but not both. Each of the two boxes costs $1. Yesterday, the seller put $3 in each box that she predicted the buyer would not acquire. Both the seller and the buyer believe the seller’s prediction to be accurate with probability 0.75.

If the buyer buys one of the boxes, then the seller makes a profit in expectation of $1 – 0.25 * $3 = $0.25. Nonetheless, causal decision theory recommends buying a box. This is because at least one of the two boxes must contain $3, so that the average box contains at least $1.50. It follows that the causal decision theorist must assign an expected causal utility of at least $1.50 to (at least) one of the boxes. Since $1.50 exceeds the cost of $1, causal decision theory recommends buying one of the boxes. This seems undesirable. So we should reject causal decision theory.

The randomization response

One of the obvious responses to the Adversarial Offer is that the agent might randomize. In the paper, we discuss this topic at length in Section IV.1 and in the subsection on ratificationism in Section IV.4. If you haven’t thought much about randomization in Newcomb-like problems before, it probably makes sense to first check out the paper and only then continue reading here, since the paper makes more straightforward points.

The information-theoretic variant

I now give a new variant of the Adversarial Offer, which deals with the randomization objection in a novel and very interesting way. Specifically, the unique feature of this variant is that CDT correctly assesses randomizing to be a bad idea. Unfortunately, it is quite a bit more complicated than the Adversarial Offer from the paper.

Imagine that the buyer is some computer program that has access to a true random number generator (TRNG). Imagine also that the buyer’s source code plus all its data (memories) has a size of, say, 1GB and that the seller knows that it has (at most) this size. If the buyer wants to buy a box, then she will have to pay $1 as usual, but instead of submitting a single choice between buying box 1 and buying box 2, she has to submit 1TB worth of choices. That is, she has to submit a sequence of 2^43 (=8796093022208) bits, each encoding a choice between the boxes.

If the buyer buys and thus submits some such string of bits w, the seller will do the following. First, the seller determines whether there exists any deterministic 1GB program that outputs w. (This is undecidable. We can fix this if the seller knows the buyer’s computational constraints. For example, if the seller knows that the buyer can do less than 1 exaFLOP worth of computation, then the seller could instead determine only whether there is a 1GB program that produces w with at most 1 exaFLOP worth of computation. This is decidable (albeit not very tractable).) If there is no such program, then the seller knows that the buyer randomized. The buyer then receives no box while the seller keeps the $1 payment. The buyer is told in advance that this is what the seller will do. Note that for this step, the seller doesn’t use her knowledge of the buyer’s source code other than its length (and its computational constraints).

If there is at least one 1GB program that outputs w deterministically, then the seller forgets w again. She then picks an index i of w at random. She predicts w[i] based on what she knows about the buyer and based on w[1…i-1], i.e., on all the bits of w preceding i. Call the prediction w’[i]. She fills the box based on her prediction w’[i] and the buyer receives (in return for the $1) the box specified by w[i].

Why the information-theoretic variant is interesting

The scenario is interesting because of the following three facts (which I will later argue to hold):

  1. The seller makes a profit off agents who try to buy boxes, regardless of whether they do so using randomization or not.
  2. CDT and related theories (such as ratificationism) assess randomizing to be worse than not buying any box.
  3. CDT will recommend buying a box (mostly deterministically).

I’m not aware of any other scenarios with these properties. Specifically, the novelty is item 2. (Our paper offers a scenario that has the other two properties.) The complications of this scenario – letting the agent submit a TB worth of choices to then determine whether they are random – are all introduced to achieve item 2 (while preserving the other items).

In the following, I want to argue in more detail for these three points and for the claim that the scenario can be set up at all (which I will do under claim 1).

1. For this part, we need to show two things:

A) Intuitively, if the agent submits a string of bits that uses substantially more than 1GB worth of randomness, then he is extremely likely to receive no box at all.

B) Intuitively, if the agent uses only about 1GB or less worth randomness, then the seller – using the buyer’s source code will likely be able to predict w[i] with high accuracy based on w[1…i-1].

I don’t want to argue too rigorously for either of these, but below I’ll give intuitions and some sketches of the information-theoretic arguments that one would need to give to make them more rigorous.

A) The very simple point here is that if you create, say, a 2GB bitstring w where each bit is determined by a fair coin flip, then it is very unlikely that there exists a program that deterministically outputs w. After all, there are many more ways to fill 2GB of bits than there are 1GB programs (about 2^(2^33) as many). From this one may be tempted to conclude that if the agent determines, say, 2GB of the TB of choices by flipping coin, he is likely to receive no box. But this argument is incomplete, because there are other ways to use coin flips. For example, the buyer might use the following policy: Flip 2GB worth of coins. If they all come up heads, always take box B. Otherwise follow some given deterministic procedure.

To make the argument rigorous, I think we need to state the claim information-theoretically. But even this is a bit tricky. For example, it is not a problem per se for w to have high entropy if most of the entropy comes from a small part of the distribution. (For example, if with probability 0.01 the seller randomizes all choices, and with the remaining probability always chooses Box 1, then the entropy of w is 0.01*1TB = 10GB (??), but the buyer is still likely to receive a box.) So I think we’d need to make a more complicated claim, of the sort: if there is no substantial part (say, >p) of the distribution over w that has less than, say, 2GB of entropy, then with high probability (>1-p), the agent will receive no box. 

B) Again, we can make a simple but incomplete argument: If of the 1TB of choices, only, say, 2GB are determined by random coin flips, then a randomly sampled bit is likely to be predictable from the agent’s source code. But again, the problem is that the random coin flips can be used in other ways. For example, the buyer might use a deterministic procedure to determine w (say, w=01010101…), but then randomly generate a number n (with any number n chosen with probability 2^-n, for instance), then randomly sample n indices j of w and flip w[j] for each of them. This may have relatively low entropy. But now the seller cannot perfectly predict w[i] given w[1…i-1] for any i.

Again, I think a rigorous argument requires information theory. In particular, we can use the fact that H(w) = H(w[0])+H(w[1]|w[0])+H(w[2]|w[0,1])+…, where H denotes entropy. If H(w) is less than, say, 2GB, then the average of H(w[i]|w[1…i-1]) must be at most 2GB/1TB = 1/500. From this, it follows immediately that with high probability, w[i] can be predicted with high accuracy given w[1…i-1].

2. This is essential, but straightforward: Generating w at random causes the seller to determine that w is determined at random. Therefore, CDT (accurately) assesses randomly generating w to have an expected utility near $-1.

3. Finally, I want to argue that CDT will recommend buying a box. For this, we only need to argue that CDT prefers some method of submitting w over not submitting a box. So consider the following procedure: First, assign beliefs over the seller’s prediction w’[0] of the first bit. Since there are only two possible boxes, for at least one of the boxes j, it is the case that P(w’[0]=j)<=½, where P refers to the probability assigned by the buyer. Let w[0] = j. We now repeat this inductively. That is, for each i given the w[1…i-1] that we have already constructed, the buyer sets w[i]=k s.t. P(w’[i]=k||w[1…i-1])<=½.

What’s the causal expected utility of submitting w thus constructed? Well, for one, because the procedure is deterministic (if ties are broken deterministically), the buyer can expect that she will receive a box at all. Now, for all i, the buyer thinks that if i is sampled by the seller for the purpose of determining which box to give to the buyer, then the buyer will in causal expectation receive $1.50, because the seller will predict the wrong box, i.e. w’[i] ≠ w[i], with probability at least ½.

The lack of performance metrics for CDT versus EDT, etc.

(This post assumes that the reader is fairly familiar with the decision theory of Newcomb-like problems. Schwarz makes many of the same points in his post “On Functional Decision Theory” (though I disagree with him on other things, such as whether to one-box or two-box in Newcomb’s problem). (ETA: Also see Bales (2018) (thanks to Sylvester Kollin for pointing out this reference).) Similar points have also been made many times about the concept of updatelessness in particular, e.g., see Section 7.3.3 of Arif Ahmed’s book “Evidence, Decision and Causality”, my post on updatelessness from a long time ago, or Sylvester Kollin’s “Understanding updatelessness in the context of EDT and CDT”. Preston Greene, on the other hand, argues explicitly for a view opposite of the one in this post in his paper “Success-First Decision Theories”.)

I sometimes read the claim that one decision theory “outperforms” some other decision theory (in general or in a particular problem). For example, Yudkowsky and Soares (2017) write: “FDT agents attain high utility in a host of decision problems that have historically proven challenging to CDT and EDT: FDT outperforms CDT in Newcomb’s problem; EDT in the smoking lesion problem; and both in Parfit’s hitchhiker problem.” Others use some variations of this framing (“dominance”, “winning”, etc.), some of which I find less dubious because they have less formal connotations.

Based on typical usage, these words make it seem as though there was some agreed upon or objective metric to compare decision theories in any particular problem and that MIRI is claiming to have found a theory that is better according to that metric (in some given problems). This would be similar to how one might say that one machine learning algorithm outperforms another on the CIFAR dataset, where everyone agrees that ML algorithms are better if they correctly classify a higher percentage of the images, require less computation time, fewer samples during training, etc.

However, there is no agreed-upon metric to compare decision theories, no way to asses even for a particular problem whether one decision theory (or its recommendation) does better than another. (This is why the CDT-versus-EDT-versus-other debate is at least partly a philosophical one.) In fact, it seems plausible that finding such a metric is “decision theory-complete” (to butcher another term with a specific meaning in computer science). By that I mean that settling on a metric is probably just as hard as settling on a decision theory and that mapping between plausible metrics and plausible decision theories is fairly easy.

For illustration, consider Newcomb’s problem and a few different metrics. One possible metric is what one might call the causal metric, which is the expected payoff if we were to replace the agent’s action with action X by some intervention from the outside. Then, for example, in Newcomb’s problem, two-boxing “performs” better than one-boxing and CDT “outperforms” FDT. I expect that many causal decision theorists would view something of this ilk as the right metric and that CDT’s recommendations are optimal according to the causal metric in a broad class of decision problems.

A second possible metric is the evidential one: given that I observe that the agent uses decision theory X (or takes action Y) in some given situation, how big of a payoff do I expect the agent to receive. This metric directly favors EDT in Newcomb’s problem, the smoking lesion, and again a broad class of decision problems.

A third possibility is a modification of the causal metric. Rather than replacing the agent’s decision, we replace its entire decision algorithm before the predictor looks at and creates a model of the agent. Despite being causal, this modification favors decision theories that recommend one-boxing in Newcomb’s problem. In general, the theory that seems to maximize this metric is some kind of updateless CDT (cf. Fisher’s disposition-based decision theory). 

Yet another causalist metric involves replacing from the outside the decisions of not only the agent itself but also of all agents that use the same decision procedure. Perhaps this leads to Timeless Decision Theory or Wolfgang Spohn’s proposal for causalist one-boxing.

One could also use the notion of regret (as discussed in the literature on multi-armed bandit problems) as a performance measure, which probably leads to ratificationism.

Lastly, I want to bring up what might be the most commonly used class of metrics: intuitions of individual people. Of course, since intuitions vary between different people, intuition provides no agreed upon metric. It does, however, provide a non-vacuous (albeit in itself weak) justification for decision theories. Whereas it seems unhelpful to defend CDT on the basis that it outperforms other decision theories according to the causal metric but is outperformed by EDT according to the evidential metric, it is interesting to consider which of, say, EDT’s and CDT’s recommendations seem intuitively correct.

Given that finding the right metric for decision theory is similar to the problem of decision theory itself, it seems odd to use words like “outperforms” which suggest the existence or assumption of a metric.

I’ll end with a few disclaimers and clarifications. First, I don’t want to discourage looking into metrics and desiderata for decision theories. I think it’s unlikely that this approach to discussing decision theory can resolve disagreements between the different camps, but that’s true for all approaches to discussing decision theory that I know of. (An interesting formal desideratum that doesn’t trivially relate to decision theories is discussed in my blog post Decision Theory and the Irrelevance of Impossible Outcomes. At its core, it’s not really about “performance measures”, though.)

Second, I don’t claim that the main conceptual point of this post is new to, say, Nate Soares or Eliezer Yudkowsky. In fact, they have written similar things, see, for instance, Ch. 13 of Yudkowsky’s Timeless Decision Theory, in which he argues that decision theories are untestable because counterfactuals are untestable. Even in the aforementioned paper, claims about outperforming are occasionally qualified. E.g., Yudkowsky and Soares (2017, sect. 10) say that they “do not yet know […] (on a formal level) what optimality consists in”.) Unfortunately, most outperformance claims remain unqualified. The metric is never specified formally or discussed much. The short verbal descriptions that are given make it hard to understand how their metric differs from the metrics corresponding to updateless CDT or updateless EDT.

So, my complaint is not so much about these authors’ views but about a Motte and Bailey-type inconsistency, in which the takeaways from reading the paper superficially are much stronger than the takeaways from reading the whole paper in-depth and paying attention to all the details and qualifications. I’m worried that the paper gives many casual readers the wrong impression. For example, gullible non-experts might get the impression that decision theory is like ML in that it is about finding algorithms that perform as well as possible according to some agreed-upon benchmarks. Uncharitable, sophisticated skim-readers may view MIRI’s positions as naive or confused about the nature of decision theory.

In my view, the lack of an agreed-upon performance measure is an important fact about the nature of decision theory research. Nonetheless, I think that, e.g., MIRI is doing and has done very valuable work on decision theory. More generally I suspect that being wrong or imprecise about this issue (that is, about the lack of performance metrics in the decision theory of Newcomb-like problems) is probably not an obstacle to having good object-level ideas. (Similarly, while I’m not a moral realist, I think being a moral realist is not necessarily an obstacle to saying interesting things about morality.)

Acknowledgement

This post is largely inspired by conversations with Johannes Treutlein. I also thank Emery Cooper for helpful comments.

The Stag Hunt against a similar opponent

[I assume that the reader is familiar with Newcomb’s problem and the Prisoner’s Dilemma against a similar opponent and ideally the equilibrium selection problem in game theory.]

The trust dilemma (a.k.a. Stag Hunt) is a game with a payoff matrix kind of like the following:

SH
S4,4-1,3
H3,-12,2

Its defining characteristic is the following: The Pareto-dominant outcome (i.e., the outcome that is best for both players) (S,S) is Nash equilibrium. However, (H,H) is also a Nash equilibrium. Moreover, if you’re sufficiently unsure what your opponent is going to do, then H is the best response. If two agents learn to play this game and they start out playing the game at random, then they are more likely to converge to (H,H). Overall, we would like it if the two agents played (S,S), but I don’t think we can assume this to happen by default.

Now what if you played the trust dilemma against a similar opponent (specifically one that is similar w.r.t. how they play games like the trust dilemma)? Clearly, if you play against an exact copy, then by the reasoning behind cooperating in the Prisoner’s Dilemma against a copy, you should play S. More generally, it seems that a similarity between you and your opponent should push towards trusting that if you play S, the opponent will also play S. The more similar you and your opponent are, the more you might reason that the decision is mostly between (S,S) and (H,H) and the less relevant are (S,H) and (H,S).

What if you played against an opponent who knows you very well and who has time to predict how you will choose in the trust dilemma? Clearly, if you play against an opponent who can perfectly predict you (e.g., because you are an AI system and they have a copy of your source code source code), then by the reasoning behind one-boxing in Newcomb’s problem, you should play S. More generally, the more you trust your opponent’s ability to predict what you do, the more you should trust that if you play S, the opponent will also play S.

Here’s what I find intriguing about these scenarios. In these scenarios, one-boxers might systematically arrive at a different (more favorable) conclusion than two-boxers. However, this conclusion is still compatible with two-boxing, or with blindly applying Nash equilibrium. In the trust dilemma, one-boxing type reasoning merely affects how we resolve the equilibrium selection problem, which the orthodox theories generally leave open. This is in contrast to the traditional examples (Prisoner’s Dilemma, Newcomb’s problem) in which the two ways of reasoning are in conflict. So there is room for implications of one-boxing, even in an exclusively Nash equilibrium-based picture of strategic interactions.

Moral realism and AI alignment

Abstract”: Some have claimed that moral realism – roughly, the claim that moral claims can be true or false – would, if true, have implications for AI alignment research, such that moral realists might approach AI alignment differently than moral anti-realists. In this post, I briefly discuss different versions of moral realism based on what they imply about AI. I then go on to argue that pursuing moral-realism-inspired AI alignment would bypass philosophical and help resolve non-philosophical disagreements related to moral realism. Hence, even from a non-realist perspective, it is desirable that moral realists (and others who understand the relevant realist perspectives well enough) pursue moral-realism-inspired AI alignment research.

Different forms of moral realism and their implications for AI alignment

Roughly, moral realism is the view that “moral claims do purport to report facts and are true if they get the facts right.” So for instance, most moral realists would hold the statement “one shouldn’t torture babies” to be true. Importantly, this moral claim is different from a claim about baby torturing being instrumentally bad given some other goal (a.k.a. a “hypothetical imperative”) such as “if one doesn’t want to land in jail, one shouldn’t torture babies.” It is uncontroversial that such claims can be true or false. Moral claims, as I understand them in this post, are also different from descriptive claims about some people’s moral views, such as “most Croatians are against babies being tortured” or “I am against babies being tortured and will act accordingly”. More generally, the versions of moral realism discussed here claim that moral truth is in some sense mind-independent. It’s not so obvious what it means for a moral claim to be true or false, so there are many different versions of moral realism. I won’t go into more detail here, though we will revisit differences between different versions of moral realism later. For a general introduction on moral realism and meta-ethics, see, e.g., the SEP article on moral realism.

I should note right here that I myself find at least “strong versions” of moral realism implausible. But in this post, I don’t want to argue about meta-ethics. Instead, I would like to discuss an implication of some versions of moral realism. I will later say more about why I am interested in the implications of a view I believe to be misguided, but for now suffice it to say that “moral realism” is a majority view among professional philosophers (though I don’t know how popular the versions of moral realism studied in this post are), which makes it interesting to explore the view’s possible implications.

The implication that I am interested in here is that moral realism helps with AI alignment in some way. One very strong version of the idea is that the orthogonality thesis is false: if there is a moral truth, agents (e.g., AIs) that are able to reason successfully about a lot of non-moral things will automatically be able to reason correctly about morality as well and will then do what they infer to be morally correct. On p. 176 of “The Most Good You Can Do”, Peter Singer defends such a view: “If there is any validity in the argument presented in chapter 8, that beings with highly developed capacities for reasoning are better able to take an impartial ethical stance, then there is some reason to believe that, even without any special effort on our part, superintelligent beings, whether biological or mechanical, will do the most good they possibly can.” In the articles “My Childhood Death Spiral”, “A Prodigy of Refutation” and “The Sheer Folly of Callow Youth” (among others), Eliezer Yudkowsky says that he used to hold such a view.

Of course, current AI techniques do not seem to automatically include moral reasoning. For instance, if you develop an automated theorem prover to reason about mathematics, it will not be able to derive “moral theorems”. Similarly, if you use the Sarsa algorithm to train some agent with some given reward function, that agent will adapt its behavior in a way that increases its cumulative reward regardless of whether doing so conflicts with some ethical imperative. The moral realist would thus have to argue that in order to get to AGI or superintelligence or some other milestone, we will necessarily have to develop new and very different reasoning algorithms and that these algorithms will necessarily incorporate ethical reasoning. Peter Singer doesn’t state this explicitly. However, he makes a similar argument about human evolution on p. 86f. in ch. 8:

The possibility that our capacity to reason can play a critical role in a decision to live ethically offers a solution to the perplexing problem that effective altruism would otherwise pose for evolutionary theory. There is no difficulty in explaining why evolution would select for a capacity to reason: that capacity enables us to solve a variety of problems, for example, to find food or suitable partners for reproduction or other forms of cooperative activity, to avoid predators, and to outwit our enemies. If our capacity to reason also enables us to see that the good of others is, from a more universal perspective, as important as our own good, then we have an explanation for why effective altruists act in accordance with such principles. Like our ability to do higher mathematics, this use of reason to recognize fundamental moral truths would be a by-product of another trait or ability that was selected for because it enhanced our reproductive fitness—something that in evolutionary theory is known as a spandrel.

A slightly weaker variant of this strong convergence moral realism is the following: Not all superintelligent beings would be able to identify or follow moral truths. However, if we add some feature that is not directly normative, then superintelligent beings would automatically identify the moral truth. For example, David Pearce appears to claim that “the pain-pleasure axis discloses the world’s inbuilt metric of (dis)value” and that therefore any superintelligent being that can feel pain and pleasure will automatically become a utilitarian. At the same time, that moral realist could believe that a non-conscious AI would not necessarily become a utilitarian. So, this slightly weaker variant of strong convergence moral realism would be consistent with the orthogonality thesis.

I find all of these strong convergence moral realisms very implausible. Especially given how current techniques in AI work – how value-neutral they are – the claim that algorithms for AGI will all automatically incorporate the same moral sense seems extraordinary and I have seen little evidence for it1 (though I should note that I have read only bits and pieces of the moral realism literature).2

It even seems easy to come up with semi-rigorous arguments against strong convergence moral realism. Roughly, it seems that we can use a moral AI to build an immoral AI. Here is a simple example of such an argument. Imagine we had an AI system that (given its computational constraints) always chooses the most moral action. Now, it seems that we could construct an immoral AI system using the following algorithm: Use the moral AI to decide which action of the immoral AI system it would prevent from being taken if it could only choose one action to be prevented. Then take that action. There is a gap in this argument: perhaps the moral AI simply refuses to choose the moral actions in “prevention” decision problems, reasoning that it might currently be used to power an immoral AI. (If exploiting a moral AI was the only way to build other AIs, then this might be the rational thing to do as there might be more exploitation attempts than real prevention scenarios.) Still (without having thought about it too much), it seems likely to me that a more elaborate version of such an argument could succeed.

Here’s a weaker moral realist convergence claim about AI alignment: There’s moral truth and we can program AIs to care about the moral truth. Perhaps it suffices to merely “tell them” to refer to the moral truth when deciding what to do. Or perhaps we would have to equip them with a dedicated “sense” for identifying moral truths. This version of moral realism again does not claim that the orthogonality thesis is wrong, i.e. that sufficiently effective AI systems will automatically behave ethically without us giving them any kind of moral guidance. It merely states that in addition to the straightforward approach of programming an AI to adopt some value system (such as utilitarianism), we could also program the AI to hold the correct moral system. Since pointing at something that exists in the world is often easier than describing that thing, it might be thought that this alternative approach to value loading is easier than the more direct one.

I haven’t found anyone who defends this view (I haven’t looked much), but non-realist Brian Tomasik gives this version of moral realism as a reason to discuss moral realism:

Moral realism is a fun philosophical topic that inevitably generates heated debates. But does it matter for practical purposes? […] One case where moral realism seems problematic is regarding superintelligence. Sometimes it’s argued that advanced artificial intelligence, in light of its superior cognitive faculties, will have a better understanding of moral truth than we do. As a result, if it’s programmed to care about moral truth, the future will go well. If one rejects the idea of moral truth, this quixotic assumption is nonsense and could lead to dangerous outcomes if taken for granted.

(Below, I will argue that there might be no reason to be afraid of moral realists. However, my argument will, like Brian’s, also imply that moral realism is worth debating in the context of AI.)

As an example, consider a moral realist view according to which moral truth is similar to mathematical truth: there are some axioms of morality which are true (for reasons I, as a non-realist, do not understand or agree with) and together these axioms imply some moral theory X. This moral realist view suggests an approach to AI alignment: program the AI to abide by these axioms (in the same way as we can have automated theorem provers assume some set of mathematical axioms to be true). It seems clear that something along these lines could work. However, this approach’s reliance on moral realism is also much weaker.

As a second example, divine command theory states that moral truth is determined by God’s will (again, I don’t see why this should be true and how it could possibly be justified). A divine command theorist might therefore want to program the AI to do whatever God wants it to do.

Here are some more such theories:

  • Social contract
  • Habermas’ discourse ethics
  • Universalizability / Kant’s categorical imperative
  • Applying human intuition

Besides pointing being easier than describing, another potential advantage of such a moral realist approach might be that one is more confident in one’s meta-ethical view (“the pointer”) than in one’s object-level moral system (“one’s own description”). For example, someone could be confident that moral truth is determined by God’s will but be unsure that God’s will is expressed via the Bible, the Quran or something else, or how these religious texts are to be understood. Then that person would probably favor AI that cares about God’s will over AI that follows some particular interpretation of, say, the moral rules proposed in the Quran and Sharia.

A somewhat related issue which has received more attention in the moral realism literature is the convergence of human moral views. People have given moral realism as an explanation for why there is near-universal agreement on some ethical views (such as “when religion and tradition do not require otherwise, one shouldn’t torture babies”). Similarly, moral realism has been associated with moral progress in human societies, see, e.g., Huemer (2016). At the same time, people have used the existence of persisting and unresolvable moral disagreements (see, e.g., Bennigson 1996 and Sayre-McCord 2017, sect. 1) and the existence of gravely immoral behavior in some intelligent people (see, e.g., Nichols 2002) as arguments against moral realism. Of course, all of these arguments take moral realism to include a convergence thesis where being a human (and perhaps not being affected by some mental disorders) or a being a society of humans is sufficient to grasp and abide by moral truth.

Of course, there are also versions of moral realism that have even weaker (or just very different) implications for AI alignment and do not make any relevant convergence claims (cf. McGrath 2010). For instance, there may be moral realists who believe that there is a moral truth but that machines are in principle incapable of finding out what it is. Some may also call very different views “moral realism”, e.g. claims that given some moral imperative, it can be decided whether an action does or does not comply with that imperative. (We might call this “hypothetical imperative realism”.) Or “linguistic” versions of moral realism which merely make claims about the meaning of moral statements as intended by whoever utters these moral statements. (Cf. Lukas Gloor’s post on how different versions of moral realism differ drastically in terms of how consequential they are.) Or a kind of “subjectivist realism”, which drops mind-independence (cf. Olson 2014, ch. 2).

Why moral-realism-inspired research on AI alignment might be useful

I can think of many reasons why moral realism-based approaches to AI safety have not been pursued much: AI researchers often do not have a sufficiently high awareness of or interest in philosophical ideas; the AI safety researchers who do – such as researchers at MIRI – tend to reject moral realism, at least the versions with implications for AI alignment; although “moral realism” is popular among philosophers, versions of moral realism with strong implications for AI (à la Peter Singer or David Pearce) might be unpopular even among philosophers (cf. again Lukas’ post on how different versions of moral realism differ drastically in terms of how consequential they are); and so on…

But why am I now proposing to conduct such research, given that I am not a moral realist myself? The main reason (besides some weaker reasons like pluralism and keeping this blog interesting) is that I believe AI alignment research from a moral realist perspective might actually increase agreement between moral realists and anti-realists about how (and to which extent) AI alignment research should be done. In the following, I will briefly argue this case for the strong (à la Peter Singer and David Pearce) and the weak convergence versions of moral realism outlined above.

Strong versions

Like most problems in philosophy, the question of whether moral realism is true lacks an accepted truth condition or an accepted way of verifying an answer or an argument for either realism or anti-realism. This is what makes these problems so puzzling and intractable. This is in contrast to problems in mathematics where it is pretty clear what counts as a proof of a hypothesis. (This is, of course, not to say that mathematics involves no creativity or that there are no general purpose “tools” for philosophy.) However, the claim made by strong convergence moral realism is more like a mathematical claim. Although it is yet to be made precise, we can easily imagine a mathematical (or computer-scientific) hypothesis stating something like this: “For any goal X of some kind [namely the objectively incorrect and non-trivial-to-achieve kind] there is no efficient algorithm that when implemented in a robot achieves X in some class of environments. So, for instance, it is in principle impossible to build a robot that turns Earth into a pile of paperclips.” It may still be hard to formalize such a claim and mathematical claims can still be hard to prove or disprove. But determining the truth of a mathematical statement is not a philosophical problem, anymore. If someone lays out a mathematical proof or disproof of such a claim, any reasonable person’s opinion would be swayed. Hence, I believe that work on proving or disproving this strong version of moral realism will lead to (more) agreement on whether the “strong-moral-realism-based theory of AI alignment” is true.

It is worth noting that finding out whether strong convergence is true may not resolve metaphysical issues. Of course, all strong versions of moral realism would turn out false if the strong convergence hypothesis were falsified. But other versions of moral realism would survive. Conversely, if the strong convergence hypothesis turned out to be true, then anti-realists may remain anti-realists (cf. footnote 2). But if our goal is to make AI moral, the convergence question is much more important than the metaphysical question. (That said, for some people the metaphysical question has a bearing on whether they have preferences over AI systems’ motivation system – “if no moral view is more true than any other, why should I care about what AI systems do?”)

Weak versions

Weak convergence versions of moral realism do not make such in-principle-testable predictions. Their only claim is the metaphysical view that the goals identified by some method X (such as derivation from a set moral axioms, finding out what God wants, discourse, etc.) have some relation to moral truths. Thinking about weak convergence moral realism from the more technical AI alignment perspective is therefore unlikely to resolve disagreements about whether some versions of weak convergence moral realism are true. However, I believe that by not making testable predictions, weak convergence versions of moral realism are also unlikely to lead to disagreement about how to achieve AI alignment.

Imagine moral realists were to propose that AI systems should reason about morality according to some method X on the basis that the result of applying X is the moral truth. Then moral anti-realists could agree with the proposal on the basis that they (mostly) agree with the results of applying method X. Indeed, for any moral theory with realist ambitions, ridding that theory of these ambitions yields a new theory which an anti-realist could defend. As an example, consider Habermas’ discourse ethics and Yudkowsky’s Coherent Extrapolated Volition. The two approaches to justifying moral views seem quite similar – roughly: do what everyone would agree with if they were exposed to more arguments. But Habermas’ theory explicitly claims to be realist while Yudkowsky is a moral anti-realist, as far as I can tell.

In principle, it could be that moral realists defend some moral view on the grounds that it is true even if it seems implausible to others. But here’s a general argument for why this is unlikely to happen. You cannot directly perceive ought statements (David Pearce and others would probably disagree) and it is easy to show that you cannot derive a statement containing an ought without using other statements containing an ought or inference rules that can be used to introduce statements containing an ought. Thus, if moral realism (as I understand it for the purpose of this paper) is true, there must be some moral axioms or inference rules that are true without needing further justification, similar to how some people view the axioms of Peano arithmetic or Euclidean geometry. An example of such a moral rule could be (a formal version of) “pain is bad”. But if these rules are “true without needing further justification”, then they are probably appealing to anti-realists as well. Of course, anti-realists wouldn’t see them as deserving the label of “truth” (or “falsehood”), but assuming that realists and anti-realists have similar moral intuitions, anything that a realist would call “true without needing further justification” should also be appealing to a moral anti-realist.

As I have argued elsewhere, it’s unlikely we will ever come up with (formal) axioms (or methods, etc.) for morality that would be widely accepted by the people of today (or even among today’s Westerners with secular ethics). But I still think it’s worth a try. If it doesn’t work out, weak convergence moral realists might come around to other approaches to AI alignment, e.g. ones based on extrapolating from human intuition.

Other realist positions

Besides realism about morality, there are many other less commonly discussed realist positions, for instance, realism about which prior probability distribution to use, whether to choose according to some expected value maximization principle (and if so which one), etc. The above considerations apply to these other realist positions as well.

Acknowledgment

I wrote this post while working for the Foundational Research Institute, which is now the Center on Long-Term Risk.


1. There are some “universal instrumental goal” approaches to justifying morality. Some are based on cooperation and work roughly like this: “Whatever your intrinsic goals are, it is often better to be nice to others so that they reciprocate. That’s what morality is.” I think such theories fail for two reasons: First, there seem to many widely accepted moral imperatives that cannot be fully justified by cooperation. For example, we usually consider it wrong for dictators to secretly torture and kill people, even if doing so has no negative consequences for them. Second, being nice to others because one hopes that they reciprocate is not, I think, what morality is about. To the contrary, I think morality is about caring things (such as other people’s welfare) intrinsically. I discuss this issue in detail with a focus on so-called “superrational cooperation” in chapter 6.7 of “Multiverse-wide Cooperation via Correlated Decision Making”. Another “universal instrumental goal” approach is the following: If there is at least one god, then not making these gods angry at you may be another universal instrumental goal, so whatever an agent’s intrinsic goal is, it will also act according to what the gods want. The same “this is not what morality is about” argument seems to apply.

2. Yudkowsky has written about why he now rejects this form of moral realism in the first couple of blog posts in the “Value Theory” series.

Goertzel’s GOLEM implements evidential decision theory applied to policy choice

I’ve written about the question of which decision theories describe the behavior of approaches to AI like the “Law of Effect”. In this post, I would like to discuss GOLEM, an architecture for a self-modifying artificial intelligence agent described by Ben Goertzel (2010; 2012). Goertzel calls it a “meta-architecture” because all of the intelligent work of the system is done by sub-programs that the architecture assumes as given, such as a program synthesis module (cf. Kaiser 2007).

Roughly, the top-level self-modification is done as follows. For any proposal for a (partial) self-modification, i.e. a new program to replace (part of) the current one, the “Predictor” module predicts how well that program would achieve the goal of the system. Another part of the system — the “Searcher” — then tries to find programs that the Predictor deems superior to the current program. So, at the top level, GOLEM chooses programs according to some form of expected value calculated by the Predictor. The first interesting decision-theoretical statement about GOLEM is therefore that it chooses policies — or, more precisely, programs — rather than individual actions. Thus, it would probably give the money in at least some versions of counterfactual mugging. This is not too surprising, because it is unclear on what basis one should choose individual actions when the effectiveness of an action depends on the agent’s decisions in other situations.

The next natural question to ask is, of course, what expected value (causal, evidential or other) the Predictor computes. Like the other aspects of GOLEM, the Predictor is subject to modification. Hence, we need to ask according to what criteria it is updated. The criterion is provided by the Tester, a “hard-wired program that estimates the quality of a candidate Predictor” based on “how well a Predictor would have performed in the past” (Goertzel 2010, p. 4). I take this to mean that the Predictor is judged based the extent to which it is able to predict the things that actually happened in the past. For instance, imagine that at some time in the past the GOLEM agent self-modified to a program that one-boxes in Newcomb’s problem. Later, the agent actually faced a Newcomb problem based on a prediction that was made before the agent self-modified into a one-boxer and won a million dollars. Then the Predictor should be able to predict that self-modifying to one-boxing in this case “yielded” getting a million dollar even though it did not do so causally. More generally, to maximize the score from the Tester, the Predictor has to compute regular (evidential) conditional probabilities and expected utilities. Hence, it seems that the EV computed by the Predictor is a regular EDT-ish one. This is not too surprising, either, because as we have seen before, it is much more common for learning algorithms to implement EDT, especially if they implement something which looks like the Law of Effect.

In conclusion, GOLEM learns to choose policy programs based on their EDT-expected value.

Acknowledgements

This post is based on a discussion with Linda Linsefors, Joar Skalse, and James Bell. I wrote this post while working for the Foundational Research Institute, which is now the Center on Long-Term Risk.

Market efficiency and charity cost-effectiveness

In an efficient market, one can expect that most goods are sold at a price-quality ratio that is hard to improve upon. If there was some easy way to produce a product cheaper or to produce a higher-quality version of it for a similar price, someone else would probably have seized that opportunity already – after all, there are many people who are interested in making money. Competing with and outperforming existing companies thus requires luck, genius or expertise. Also, if you trust other buyers to be reasonable, you can more or less blindly buy any “best-selling” product.

Several people, including effective altruists, have remarked that this is not true in the case of charities. Since most donors don’t systematically choose the most cost-effective charities, most donations go to charities that are much less cost-effective than the best ones. Thus, if you sit on a pile of resources – your career, say – outperforming the average charity at doing good is fairly easy.

The fact that charities don’t compete for cost-effectiveness doesn’t mean there’s no competition at all. Just like businesses in the private sector compete for customers, charities compete for donors. It just happens to be the case that being good at convincing people to donate doesn’t correlate strongly with cost-effectiveness.

Note that in the private sector, too, there can be a misalignment between persuading customers and producing the kind of product you are interested in, or even the kind of product that customers in general will enjoy or benefit from using. Any example will be at least somewhat controversial, as it will suggest that buyers make suboptimal choices. Nevertheless, I think addictive drugs like cigarettes are an example that many people can agree with. Cigarettes seem to provide almost no benefits to consumers, at least relative to taking nicotine directly. Nevertheless, people buy them, perhaps because smoking is associated with being cool or because they are addictive.

One difference between competition in the for-profit and nonprofit sectors is that the latter lacks monetary incentives. It’s nearly impossible to become rich by founding or working at a charity. Thus, people primarily interested in money won’t start a charity, even if they have developed a method of persuading people of some idea that is much more effective than existing methods. However, making a charity succeed is still rewarded with status and (the belief in) having had an impact. So in terms of persuading people to donate, the charity “market” is probably somewhat efficient in areas that confer status and that potential founders and employees intrinsically care about.

If you care about investing your resource pile most efficiently, this efficiency at persuading donors offers little consolation. On the contrary, it even predicts that if you use your resources to found or support an especially cost-effective charity, fundraising will be difficult. Perhaps you previously thought that, since your charity is “better”, it will also receive more donations than existing ineffective charities. But now it seems that if cost-effectiveness really helped with fundraising, more charities would have already become more cost-effective.

There are, however, cause areas in which the argument about effectiveness at persuasion carries a different tone. In these cause areas, being good at fundraising strongly correlates with being good at what the charity is supposed to do. An obvious example is that of charities whose goal it is to fundraise for other charities, such as Raising for Effective Giving. (Disclosure: I work for REG’s sister organization FRI and am a board member of REG’s parent organization EAF.) If an organization is good at fundraising for itself, it’s probably also good at fundraising for others. So if there are already lots of organizations whose goal it is to fundraise for other organizations, one might expect that these organizations already do this job so well that they are hard to outperform in terms of money moved per resources spent. (Again, some of these may be better because they fundraise for charities that generate more value according to your moral view.)

Advocacy is another cause area in which successfully persuading donors correlates with doing a very good job overall. If an organization can persuade people to donate and volunteer to promote veganism, it seems plausible that they are also good at promoting veganism. Perhaps most of the organization’s budget even comes from people they persuaded to become vegan, in which case their ability to find donors and volunteers is a fairly direct measure of their ability to persuade people to adopt a vegan diet. (Note that I am, of course, not saying that competition ensures that organizations persuade people of the most useful ideas.) As with fundraising organizations, this suggests that it’s hard to outperform advocacy groups in areas where lots of people have incentives to advocate, because if there were some simple method of persuading people, it’s very likely that some large organization based on that method would have already been established.

That said, there are many caveats to this argument for a strong correlation between fundraising and advocacy effectiveness. First off, for many organizations, fundraising appears to be primarily about finding, retaining and escalating a small number of wealthy donors. For some organizations, a similar statement might be true about finding volunteers and employees. In contrast, the goal of most advocacy organizations is to persuade a large number of people.1 So there may be organizations whose members are very persuasive in person and thus capable of bringing in many large donors, but who don’t have any idea about how to run a large-scale campaign oriented toward “the masses”. When trying to identify cost-effective advocacy charities, this problem can, perhaps, be addressed by giving some weight to the number of donations that a charity brings in, as opposed to donation sizes alone.2 However, the more important point is that if growing big is about big donors, then a given charity’s incentives and selection pressures for survival and growth are misaligned with persuading many people. Thus, it becomes more plausible again that the average big or fast-growing advocacy-based charity is a suboptimal use of your resource pile.

Second, I stipulated that a good way of getting new donors and volunteers is to simply persuade as many people of your general message as possible, and then hope that some of these will also volunteer at or donate to your organization. But even if all donors contribute similar amounts, some target audiences are more likely to donate than others.3 In particular, people seem more likely to contribute larger amounts if they have been involved for longer, have already donated or volunteered, and/or hold a stronger or more radical version of your organization’s views. But persuading these community members to donate works in very different ways than persuading new people. For example, being visible to the community becomes more important. Also, if donating is about identity and self-expression, it becomes more important to advocate in ways that express the community’s shared identity rather than in ways that are persuasive but compromising. The target audiences for fundraising and advocacy may also vary a lot along other dimensions: for example, to win an election, a political party has to persuade undecided voters, who tend to be uninformed and not particularly interested in politics (see p. 312 of Achen and Bartel’s Democracy for Realists); but to collect donations, one has to mobilize long-term party members who probably read lots of news, etc.

Third, the fastest-growing advocacy organizations may have large negative externalities.4 Absent regulations and special taxes, the production of the cheapest products will often damage some public good, e.g., through carbon emissions or the corruption of public institutions. Similarly, advocacy charities may damage some public good. The fastest way to find new members may involve being overly controversial, dumbing down the message or being associated with existing powerful interests, which may damage the reputation of a movement. For example, the neoliberals often suffer from being associated with special/business interests and crony capitalism (see sections “Creating a natural constituency” and “Cooption” in Kerry Vaughan’s What the EA community can learn from the rise of the neoliberals), perhaps because associating with business interests often carries short-term benefits for an individual actor. Again, this suggests that the fastest-growing advocacy charity may be much worse overall than the optimal one.

Acknowledgements

I thank Jonas Vollmer, Persis Eskander and Johannes Treutlein for comments. This work was funded by the Foundational Research Institute (now the Center on Long-Term Risk).


1. Lobbying organizations, which try to persuade individual legislators, provide a useful contrast. Especially in countries with common law, organizations may also attempt to win individual legal cases.

2. One thing to keep in mind is that investing effort into persuading big donors is probably a good strategy for many organizations. Thus, a small-donor charity that grows less quickly than a big-donor charity may be be more or less cost-effective than the big-donor charity.

3. One of the reasons why one might think that drawing in new people is most effective is that people who are already in the community and willing to donate to an advocacy org probably just fund the charity that persuaded them in the first place. Of course, many people may simply not follow the sentiment of donating to the charity that persuaded them. However, many community members may have been persuaded in ways that don’t present such a default option. For example, many people were persuaded to go vegan by reading Animal Liberation. Since the book’s author, Peter Singer, has no room for more funding, these people have to find other animal advocacy organizations to donate to.

4. Thanks to Persis Eskander for bringing up this point in response to an early version of this post.

The law of effect, randomization and Newcomb’s problem

[ETA (January 2022): My co-authors James Bell, Linda Linsefors and Joar Skalse and I give a much more detailed analysis of the dynamics discussed in this post in our paper titled “Reinforcement Learning in Newcomblike Environments”, published at NeurIPS 2021.]

The law of effect (LoE), as introduced on p. 244 of Thorndike’s (1911) Animal Intelligence, states:

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond.

As I (and others) have pointed out elsewhere, an agent applying LoE would come to “one-box” (i.e., behave like evidential decision theory (EDT)) in Newcomb-like problems in which the payoff is eventually observed. For example, if you face Newcomb’s problem itself multiple times, then one-boxing will be associated with winning a million dollars and two-boxing with winning only a thousand dollars. (As noted in the linked note, this assumes that the different instances of Newcomb’s problem are independent. For instance, one-boxing in the first does not influence the prediction in the second. It is also assumed that CDT cannot precommit to one-boxing, e.g. because precommitment is impossible in general or because the predictions have been made long ago and thus cannot be causally influenced anymore.)

A caveat to this result is that with randomization one can derive more causal decision theory-like behavior from alternative versions of LoE. Imagine an agent that chooses probability distributions over actions, such as the distribution P with P(one-box)=0.8 and P(two-box)=0.2. The agent’s physical action is then sampled from that probability distribution. Furthermore, assume that the predictor in Newcomb’s problem can only predict the probability distribution and not the sampled action and that he fills box B with the probability the agent chooses for one-boxing. If this agent plays many instances of Newcomb’s problem, then she will ceteris paribus fare better in rounds in which she two-boxes. By LoE, she may therefore update toward two-boxing being the better option and consequently two-box with higher probability. Throughout the rest of this post, I will expound on the “goofiness” of this application of LoE.

Notice that this is not the only possible way to apply LoE. Indeed, the more natural way seems to be to apply LoE only to whatever entity the agent has the power to choose rather than something that is influenced by that choice. In this case, this is the probability distribution and not the action resulting from that probability distribution. Applied at the level of the probability distribution, LoE again leads to EDT. For example, in Newcomb’s problem the agent receives more money in rounds in which it chooses a higher probability of one-boxing. Let’s call this version of LoE “standard LoE”. We will call other versions, in which choice is updated to bring some other variable (in this case the physical action) to assume values that are associated with high payoffs, “non-standard LoE”.

Although non-standard LoE yields CDT-ish behavior in Newcomb’s problem, it can easily be criticized on causalist grounds. Consider a non-Newcomblike variant of Newcomb’s problem in which there is no predictor but merely an entity that reads the agent’s mind and fills box B with a million dollars in causal dependence on the probability distribution chosen by the agent. The causal graph representing this decision problem is given below with the subject of choice being marked red. Unless they are equipped with an incomplete model of the world – one that doesn’t include the probability distribution step –, CDT and EDT agree that one should choose the probability distribution over actions that one-boxes with probability 1 in this variant of Newcomb’s problem. After all, choosing that probability distribution causes the game master to see that you will probably one-box and thus also causes him to put money under box B. But if you play this alternative version of Newcomb’s problem and use LoE on the level of one- versus two-boxing, then you would converge on two-boxing because, again, you will fare better in rounds in which you happen to two-box.

RandomizationBlogPost.jpg

Be it in Newcomb’s original problem or in this variant of Newcomb’s problem, non-standard LoE can lead to learning processes that don’t seem to match LoE’s “spirit”. When you apply standard LoE (and probably also in most cases of applying non-standard LoE), you develop a tendency to exhibit rewarded choices, and this will lead to more reward in the future. But if you adjust your choices with some intermediate variable in mind, you may get worse and worse. For instance, in either the regular or non-Newcomblike Newcomb’s problem, non-standard LoE adjusts the choice (the probability distribution over actions) so that the (physically implemented) action is more likely to be the one associated with higher reward (two-boxing), but the choice itself (high probability of two-boxing) will be one that is associated with low rewards. Thus, learning according to non-standard LoE can lead to decreasing rewards (in both Newcomblike and non-Newcomblike problems).

All in all, what I call non-standard LoE looks a bit like a hack rather than some systematic, sound version of CDT learning.

As a side note, the sensitivity to the details of how LoE is set up relative to randomization shows that the decision theory (CDT versus EDT versus something else) implied by some agent design can sometimes be very fragile. I originally thought that there would generally be some correspondence between agent designs and decision theories, such that changing the decision theory implemented by an agent usually requires large-scale changes to the agent’s architecture. But switching from standard LoE to non-standard LoE is an example where what seems like a relatively small change can significantly change the resulting behavior in Newcomb-like problems. Randomization in decision markets is another such example. (And the Gödel machine is yet another example, albeit one that seems less relevant in practice.)

Acknowledgements

I thank Lukas Gloor, Tobias Baumann and Max Daniel for advance comments. This work was funded by the Foundational Research Institute (now the Center on Long-Term Risk).

Pearl on causality

Here’s a quote by Judea Pearl (from p. 419f. of the Epilogue of the second edition of Causality) that, in light of his other writing on the topic, I found surprising when I first read it:

Let us examine how the surgery interpretation resolves Russell’s enigma concerning the clash between the directionality of causal relations and the symmetry of physical equations. The equations of physics are indeed symmetrical, but when we compare the phrases “A causes B” versus “B causes A,” we are not talking about a single set of equations. Rather, we are comparing two world models, represented by two different sets of equations: one in which the equation for A is surgically removed; the other where the equation for B is removed. Russell would probably stop us at this point and ask: “How can you talk about two world models when in fact there is only one world model, given by all the equations of physics put together?” The answer is: yes. If you wish to include the entire universe in the model, causality disappears because interventions disappear – the manipulator and the manipulated lose their distinction. However, scientists rarely consider the entirety of the universe as an object of investigation. In most cases the scientist carves a piece from the universe and proclaims that piece in – namely, the focus of investigation. The rest of the universe is then considered out or background and is summarized by what we call boundary conditions. This choice of ins and outs creates asymmetry in the way we look at things, and it is this asymmetry that permits us to talk about “outside intervention” and hence about causality and cause-effect directionality.