An analogy for the simulation argument

To people already familiar with the argument, this may be old news. It’s an analogy to help people feel the intuitive force of the argument.

Suppose the following: New diseases appear inside people all the time. Only one in ten diseases is contagious, but the contagious diseases spread to a thousand other people, on average, before dying out. The non-contagious ones don’t spread to anyone else. You get sick. You haven’t yet gone to the doctor or googled your symptoms; given the above facts, part of you wonders whether you should bother… after all, 90% of diseases only ever infect one person. So, how likely is it that someone else has had your disease before?

Draw a Venn diagram with two circles: “First victim” and “contagious.” This makes three interesting categories:

A = First victims of noncontagious diseases

B = First victims of contagious diseases

C = Nonfirst victims of contagious diseases

 Since only one in ten diseases is contagious, B/A = 1/9.Since contagious diseases spread to a thousand people on average, B/C is 1/1000. It follows that C/(A+B+C) = 100/101.

Since you know you are diseased, you should assign credence to each hypothesis “I’m in A” “I’m in B” “I’m in C” equal to the fraction of diseased people who are in that category, at least until you get evidence that you would be more likely to get in one category than another.

(For example, if you google your symptoms and get zero hits, that’s evidence that is more likely to happen to people in group A or B, so you should update to be less confident that you are in group C than the base rate would suggest.)

So you are almost certain to be in category C, since they make up 100/101 of the total population of diseased people, and at the moment you don’t have any evidence that is less likely to happen to people in category C than people in the other categories. So it’s almost certain that other people have had your disease already, even though 90% of diseases only ever infect one person.

In the simulation argument, the reasoning is similar.

Suppose that 90% of non-simulated civilizations never create ancestor sims, but that the remaining 10% create 1000 each on average. Then most civilizations are ancestor-simulated.

Do we have any evidence that is more likely to happen to people who are ancestor-simulated than people who are not, or vice versa?

You might think the answer is yes: For example, simulated people are more likely to see things that appear to break the laws of physics, more likely to see pop-up windows saying “you are in a simulation,” etc. 

But Bostrom anticipates this when he restricts his argument to ancestor simulations. Ancestor simulations are designed to perfectly mimic real history, so the answer is no: Someone in category C is just as likely to see what we see as someone in category B or A.

So even if we rule out non-ancestor-simulations entirely, the argument goes through: Most civilizations with evidence like ours are ancestor-simulated, and we don’t have any reason to think we are special, so we are probably ancestor-simulated. And of course, we shouldn’t rule out non-ancestor-simulations entirely, so the probability that we are simulated should be even higher than the probability we are ancestor-simulated.

(Clarificatory caveat: Bostrom’s argument is NOT for the conclusion that we are in a simulation; rather, it is a triple-disjunct. In the framing discussed here, Bostrom leaves open the possibility that Group A is extremely large relative to Group B, large enough that it is bigger than B and C combined.)



Great Map of the Mind

We have all these theories and debates about parts of the mind; why not make a big map to show how they all fit together? 

Obviously, minds aren’t all the same. But having a map like this helps us organize our thoughts, to better understand our own minds and the minds we are trying to design and reason about. I’d love to see better versions, or elaborations of this one, or entirely different mind-designs.

Here is the link in case you want to look at it more closely or spin off a copy to make a version of your own.

Moral realism and AI alignment

Abstract”: Some have claimed that moral realism – roughly, the claim that moral claims can be true or false – would, if true, have implications for AI alignment research, such that moral realists might approach AI alignment differently than moral anti-realists. In this post, I briefly discuss different versions of moral realism based on what they imply about AI. I then go on to argue that pursuing moral-realism-inspired AI alignment would bypass philosophical and help resolve non-philosophical disagreements related to moral realism. Hence, even from a non-realist perspective, it is desirable that moral realists (and others who understand the relevant realist perspectives well enough) pursue moral-realism-inspired AI alignment research.

Different forms of moral realism and their implications for AI alignment

Roughly, moral realism is the view that “moral claims do purport to report facts and are true if they get the facts right.” So for instance, most moral realists would hold the statement “one shouldn’t torture babies” to be true. Importantly, this moral claim is different from a claim about baby torturing being instrumentally bad given some other goal (a.k.a. a “hypothetical imperative”) such as “if one doesn’t want to land in jail, one shouldn’t torture babies.” It is uncontroversial that such claims can be true or false. Moral claims, as I understand them in this post, are also different from descriptive claims about some people’s moral views, such as “most Croatians are against babies being tortured” or “I am against babies being tortured and will act accordingly”. More generally, the versions of moral realism discussed here claim that moral truth is in some sense mind-independent. It’s not so obvious what it means for a moral claim to be true or false, so there are many different versions of moral realism. I won’t go into more detail here, though we will revisit differences between different versions of moral realism later. For a general introduction on moral realism and meta-ethics, see, e.g., the SEP article on moral realism.

I should note right here that I myself find at least “strong versions” of moral realism implausible. But in this post, I don’t want to argue about meta-ethics. Instead, I would like to discuss an implication of some versions of moral realism. I will later say more about why I am interested in the implications of a view I believe to be misguided, but for now suffice it to say that “moral realism” is a majority view among professional philosophers (though I don’t know how popular the versions of moral realism studied in this post are), which makes it interesting to explore the view’s possible implications.

The implication that I am interested in here is that moral realism helps with AI alignment in some way. One very strong version of the idea is that the orthogonality thesis is false: if there is a moral truth, agents (e.g., AIs) that are able to reason successfully about a lot of non-moral things will automatically be able to reason correctly about morality as well and will then do what they infer to be morally correct. On p. 176 of “The Most Good You Can Do”, Peter Singer defends such a view: “If there is any validity in the argument presented in chapter 8, that beings with highly developed capacities for reasoning are better able to take an impartial ethical stance, then there is some reason to believe that, even without any special effort on our part, superintelligent beings, whether biological or mechanical, will do the most good they possibly can.” In the articles “My Childhood Death Spiral”, “A Prodigy of Refutation” and “The Sheer Folly of Callow Youth” (among others), Eliezer Yudkowsky says that he used to hold such a view.

Of course, current AI techniques do not seem to automatically include moral reasoning. For instance, if you develop an automated theorem prover to reason about mathematics, it will not be able to derive “moral theorems”. Similarly, if you use the Sarsa algorithm to train some agent with some given reward function, that agent will adapt its behavior in a way that increases its cumulative reward regardless of whether doing so conflicts with some ethical imperative. The moral realist would thus have to argue that in order to get to AGI or superintelligence or some other milestone, we will necessarily have to develop new and very different reasoning algorithms and that these algorithms will necessarily incorporate ethical reasoning. Peter Singer doesn’t state this explicitly. However, he makes a similar argument about human evolution on p. 86f. in ch. 8:

The possibility that our capacity to reason can play a critical role in a decision to live ethically offers a solution to the perplexing problem that effective altruism would otherwise pose for evolutionary theory. There is no difficulty in explaining why evolution would select for a capacity to reason: that capacity enables us to solve a variety of problems, for example, to find food or suitable partners for reproduction or other forms of cooperative activity, to avoid predators, and to outwit our enemies. If our capacity to reason also enables us to see that the good of others is, from a more universal perspective, as important as our own good, then we have an explanation for why effective altruists act in accordance with such principles. Like our ability to do higher mathematics, this use of reason to recognize fundamental moral truths would be a by-product of another trait or ability that was selected for because it enhanced our reproductive fitness—something that in evolutionary theory is known as a spandrel.

A slightly weaker variant of this strong convergence moral realism is the following: Not all superintelligent beings would be able to identify or follow moral truths. However, if we add some feature that is not directly normative, then superintelligent beings would automatically identify the moral truth. For example, David Pearce appears to claim that “the pain-pleasure axis discloses the world’s inbuilt metric of (dis)value” and that therefore any superintelligent being that can feel pain and pleasure will automatically become a utilitarian. At the same time, that moral realist could believe that a non-conscious AI would not necessarily become a utilitarian. So, this slightly weaker variant of strong convergence moral realism would be consistent with the orthogonality thesis.

I find all of these strong convergence moral realisms very implausible. Especially given how current techniques in AI work – how value-neutral they are – the claim that algorithms for AGI will all automatically incorporate the same moral sense seems extraordinary and I have seen little evidence for it1 (though I should note that I have read only bits and pieces of the moral realism literature).2

It even seems easy to come up with semi-rigorous arguments against strong convergence moral realism. Roughly, it seems that we can use a moral AI to build an immoral AI. Here is a simple example of such an argument. Imagine we had an AI system that (given its computational constraints) always chooses the most moral action. Now, it seems that we could construct an immoral AI system using the following algorithm: Use the moral AI to decide which action of the immoral AI system it would prevent from being taken if it could only choose one action to be prevented. Then take that action. There is a gap in this argument: perhaps the moral AI simply refuses to choose the moral actions in “prevention” decision problems, reasoning that it might currently be used to power an immoral AI. (If exploiting a moral AI was the only way to build other AIs, then this might be the rational thing to do as there might be more exploitation attempts than real prevention scenarios.) Still (without having thought about it too much), it seems likely to me that a more elaborate version of such an argument could succeed.

Here’s a weaker moral realist convergence claim about AI alignment: There’s moral truth and we can program AIs to care about the moral truth. Perhaps it suffices to merely “tell them” to refer to the moral truth when deciding what to do. Or perhaps we would have to equip them with a dedicated “sense” for identifying moral truths. This version of moral realism again does not claim that the orthogonality thesis is wrong, i.e. that sufficiently effective AI systems will automatically behave ethically without us giving them any kind of moral guidance. It merely states that in addition to the straightforward approach of programming an AI to adopt some value system (such as utilitarianism), we could also program the AI to hold the correct moral system. Since pointing at something that exists in the world is often easier than describing that thing, it might be thought that this alternative approach to value loading is easier than the more direct one.

I haven’t found anyone who defends this view (I haven’t looked much), but non-realist Brian Tomasik gives this version of moral realism as a reason to discuss moral realism:

Moral realism is a fun philosophical topic that inevitably generates heated debates. But does it matter for practical purposes? […] One case where moral realism seems problematic is regarding superintelligence. Sometimes it’s argued that advanced artificial intelligence, in light of its superior cognitive faculties, will have a better understanding of moral truth than we do. As a result, if it’s programmed to care about moral truth, the future will go well. If one rejects the idea of moral truth, this quixotic assumption is nonsense and could lead to dangerous outcomes if taken for granted.

(Below, I will argue that there might be no reason to be afraid of moral realists. However, my argument will, like Brian’s, also imply that moral realism is worth debating in the context of AI.)

As an example, consider a moral realist view according to which moral truth is similar to mathematical truth: there are some axioms of morality which are true (for reasons I, as a non-realist, do not understand or agree with) and together these axioms imply some moral theory X. This moral realist view suggests an approach to AI alignment: program the AI to abide by these axioms (in the same way as we can have automated theorem provers assume some set of mathematical axioms to be true). It seems clear that something along these lines could work. However, this approach’s reliance on moral realism is also much weaker.

As a second example, divine command theory states that moral truth is determined by God’s will (again, I don’t see why this should be true and how it could possibly be justified). A divine command theorist might therefore want to program the AI to do whatever God wants it to do.

Here are some more such theories:

  • Social contract
  • Habermas’ discourse ethics
  • Universalizability / Kant’s categorical imperative
  • Applying human intuition

Besides pointing being easier than describing, another potential advantage of such a moral realist approach might be that one is more confident in one’s meta-ethical view (“the pointer”) than in one’s object-level moral system (“one’s own description”). For example, someone could be confident that moral truth is determined by God’s will but be unsure that God’s will is expressed via the Bible, the Quran or something else, or how these religious texts are to be understood. Then that person would probably favor AI that cares about God’s will over AI that follows some particular interpretation of, say, the moral rules proposed in the Quran and Sharia.

A somewhat related issue which has received more attention in the moral realism literature is the convergence of human moral views. People have given moral realism as an explanation for why there is near-universal agreement on some ethical views (such as “when religion and tradition do not require otherwise, one shouldn’t torture babies”). Similarly, moral realism has been associated with moral progress in human societies, see, e.g., Huemer (2016). At the same time, people have used the existence of persisting and unresolvable moral disagreements (see, e.g., Bennigson 1996 and Sayre-McCord 2017, sect. 1) and the existence of gravely immoral behavior in some intelligent people (see, e.g., Nichols 2002) as arguments against moral realism. Of course, all of these arguments take moral realism to include a convergence thesis where being a human (and perhaps not being affected by some mental disorders) or a being a society of humans is sufficient to grasp and abide by moral truth.

Of course, there are also versions of moral realism that have even weaker (or just very different) implications for AI alignment and do not make any relevant convergence claims (cf. McGrath 2010). For instance, there may be moral realists who believe that there is a moral truth but that machines are in principle incapable of finding out what it is. Some may also call very different views “moral realism”, e.g. claims that given some moral imperative, it can be decided whether an action does or does not comply with that imperative. (We might call this “hypothetical imperative realism”.) Or “linguistic” versions of moral realism which merely make claims about the meaning of moral statements as intended by whoever utters these moral statements. (Cf. Lukas Gloor’s post on how different versions of moral realism differ drastically in terms of how consequential they are.) Or a kind of “subjectivist realism”, which drops mind-independence (cf. Olson 2014, ch. 2).

Why moral-realism-inspired research on AI alignment might be useful

I can think of many reasons why moral realism-based approaches to AI safety have not been pursued much: AI researchers often do not have a sufficiently high awareness of or interest in philosophical ideas; the AI safety researchers who do – such as researchers at MIRI – tend to reject moral realism, at least the versions with implications for AI alignment; although “moral realism” is popular among philosophers, versions of moral realism with strong implications for AI (à la Peter Singer or David Pearce) might be unpopular even among philosophers (cf. again Lukas’ post on how different versions of moral realism differ drastically in terms of how consequential they are); and so on…

But why am I now proposing to conduct such research, given that I am not a moral realist myself? The main reason (besides some weaker reasons like pluralism and keeping this blog interesting) is that I believe AI alignment research from a moral realist perspective might actually increase agreement between moral realists and anti-realists about how (and to which extent) AI alignment research should be done. In the following, I will briefly argue this case for the strong (à la Peter Singer and David Pearce) and the weak convergence versions of moral realism outlined above.

Strong versions

Like most problems in philosophy, the question of whether moral realism is true lacks an accepted truth condition or an accepted way of verifying an answer or an argument for either realism or anti-realism. This is what makes these problems so puzzling and intractable. This is in contrast to problems in mathematics where it is pretty clear what counts as a proof of a hypothesis. (This is, of course, not to say that mathematics involves no creativity or that there are no general purpose “tools” for philosophy.) However, the claim made by strong convergence moral realism is more like a mathematical claim. Although it is yet to be made precise, we can easily imagine a mathematical (or computer-scientific) hypothesis stating something like this: “For any goal X of some kind [namely the objectively incorrect and non-trivial-to-achieve kind] there is no efficient algorithm that when implemented in a robot achieves X in some class of environments. So, for instance, it is in principle impossible to build a robot that turns Earth into a pile of paperclips.” It may still be hard to formalize such a claim and mathematical claims can still be hard to prove or disprove. But determining the truth of a mathematical statement is not a philosophical problem, anymore. If someone lays out a mathematical proof or disproof of such a claim, any reasonable person’s opinion would be swayed. Hence, I believe that work on proving or disproving this strong version of moral realism will lead to (more) agreement on whether the “strong-moral-realism-based theory of AI alignment” is true.

It is worth noting that finding out whether strong convergence is true may not resolve metaphysical issues. Of course, all strong versions of moral realism would turn out false if the strong convergence hypothesis were falsified. But other versions of moral realism would survive. Conversely, if the strong convergence hypothesis turned out to be true, then anti-realists may remain anti-realists (cf. footnote 2). But if our goal is to make AI moral, the convergence question is much more important than the metaphysical question. (That said, for some people the metaphysical question has a bearing on whether they have preferences over AI systems’ motivation system – “if no moral view is more true than any other, why should I care about what AI systems do?”)

Weak versions

Weak convergence versions of moral realism do not make such in-principle-testable predictions. Their only claim is the metaphysical view that the goals identified by some method X (such as derivation from a set moral axioms, finding out what God wants, discourse, etc.) have some relation to moral truths. Thinking about weak convergence moral realism from the more technical AI alignment perspective is therefore unlikely to resolve disagreements about whether some versions of weak convergence moral realism are true. However, I believe that by not making testable predictions, weak convergence versions of moral realism are also unlikely to lead to disagreement about how to achieve AI alignment.

Imagine moral realists were to propose that AI systems should reason about morality according to some method X on the basis that the result of applying X is the moral truth. Then moral anti-realists could agree with the proposal on the basis that they (mostly) agree with the results of applying method X. Indeed, for any moral theory with realist ambitions, ridding that theory of these ambitions yields a new theory which an anti-realist could defend. As an example, consider Habermas’ discourse ethics and Yudkowsky’s Coherent Extrapolated Volition. The two approaches to justifying moral views seem quite similar – roughly: do what everyone would agree with if they were exposed to more arguments. But Habermas’ theory explicitly claims to be realist while Yudkowsky is a moral anti-realist, as far as I can tell.

In principle, it could be that moral realists defend some moral view on the grounds that it is true even if it seems implausible to others. But here’s a general argument for why this is unlikely to happen. You cannot directly perceive ought statements (David Pearce and others would probably disagree) and it is easy to show that you cannot derive a statement containing an ought without using other statements containing an ought or inference rules that can be used to introduce statements containing an ought. Thus, if moral realism (as I understand it for the purpose of this paper) is true, there must be some moral axioms or inference rules that are true without needing further justification, similar to how some people view the axioms of Peano arithmetic or Euclidean geometry. An example of such a moral rule could be (a formal version of) “pain is bad”. But if these rules are “true without needing further justification”, then they are probably appealing to anti-realists as well. Of course, anti-realists wouldn’t see them as deserving the label of “truth” (or “falsehood”), but assuming that realists and anti-realists have similar moral intuitions, anything that a realist would call “true without needing further justification” should also be appealing to a moral anti-realist.

As I have argued elsewhere, it’s unlikely we will ever come up with (formal) axioms (or methods, etc.) for morality that would be widely accepted by the people of today (or even among today’s Westerners with secular ethics). But I still think it’s worth a try. If it doesn’t work out, weak convergence moral realists might come around to other approaches to AI alignment, e.g. ones based on extrapolating from human intuition.

Other realist positions

Besides realism about morality, there are many other less commonly discussed realist positions, for instance, realism about which prior probability distribution to use, whether to choose according to some expected value maximization principle (and if so which one), etc. The above considerations apply to these other realist positions as well.


1. There are some “universal instrumental goal” approaches to justifying morality. Some are based on cooperation and work roughly like this: “Whatever your intrinsic goals are, it is often better to be nice to others so that they reciprocate. That’s what morality is.” I think such theories fail for two reasons: First, there seem to many widely accepted moral imperatives that cannot be fully justified by cooperation. For example, we usually consider it wrong for dictators to secretly torture and kill people, even if doing so has no negative consequences for them. Second, being nice to others because one hopes that they reciprocate is not, I think, what morality is about. To the contrary, I think morality is about caring things (such as other people’s welfare) intrinsically. I discuss this issue in detail with a focus on so-called “superrational cooperation” in chapter 6.7 of “Multiverse-wide Cooperation via Correlated Decision Making”. Another “universal instrumental goal” approach is the following: If there is at least one god, then not making these gods angry at you may be another universal instrumental goal, so whatever an agent’s intrinsic goal is, it will also act according to what the gods want. The same “this is not what morality is about” argument seems to apply.

2. Yudkowsky has written about why he now rejects this form of moral realism in the first couple of blog posts in the “Value Theory” series.

Goertzel’s GOLEM implements evidential decision theory applied to policy choice

I’ve written about the question of which decision theories describe the behavior of approaches to AI like the “Law of Effect”. In this post, I would like to discuss GOLEM, an architecture for a self-modifying artificial intelligence agent described by Ben Goertzel (2010; 2012). Goertzel calls it a “meta-architecture” because all of the intelligent work of the system is done by sub-programs that the architecture assumes as given, such as a program synthesis module (cf. Kaiser 2007).

Roughly, the top-level self-modification is done as follows. For any proposal for a (partial) self-modification, i.e. a new program to replace (part of) the current one, the “Predictor” module predicts how well that program would achieve the goal of the system. Another part of the system — the “Searcher” — then tries to find programs that the Predictor deems superior to the current program. So, at the top level, GOLEM chooses programs according to some form of expected value calculated by the Predictor. The first interesting decision-theoretical statement about GOLEM is therefore that it chooses policies — or, more precisely, programs — rather than individual actions. Thus, it would probably give the money in at least some versions of counterfactual mugging. This is not too surprising, because it is unclear on what basis one should choose individual actions when the effectiveness of an action depends on the agent’s decisions in other situations.

The next natural question to ask is, of course, what expected value (causal, evidential or other) the Predictor computes. Like the other aspects of GOLEM, the Predictor is subject to modification. Hence, we need to ask according to what criteria it is updated. The criterion is provided by the Tester, a “hard-wired program that estimates the quality of a candidate Predictor” based on “how well a Predictor would have performed in the past” (Goertzel 2010, p. 4). I take this to mean that the Predictor is judged based the extent to which it is able to predict the things that actually happened in the past. For instance, imagine that at some time in the past the GOLEM agent self-modified to a program that one-boxes in Newcomb’s problem. Later, the agent actually faced a Newcomb problem based on a prediction that was made before the agent self-modified into a one-boxer and won a million dollars. Then the Predictor should be able to predict that self-modifying to one-boxing in this case “yielded” getting a million dollar even though it did not do so causally. More generally, to maximize the score from the Tester, the Predictor has to compute regular (evidential) conditional probabilities and expected utilities. Hence, it seems that the EV computed by the Predictor is a regular EDT-ish one. This is not too surprising, either, because as we have seen before, it is much more common for learning algorithms to implement EDT, especially if they implement something which looks like the Law of Effect.

In conclusion, GOLEM learns to choose policy programs based on their EDT-expected value.

Acknowledgements

This post is based on a discussion with Linda Linsefors, Joar Skalse, and James Bell.

Three wagers for multiverse-wide superrationality

In this post, I outline three wagers in favor of the hypothesis that multiverse-wide superrationality (MSR) has action-guiding implications. MSR is based on three core assumptions:

  1. There is a large or infinite universe or multiverse.
  2. Applying an acausal decision theory.
  3. An agent’s actions provide evidence about the actions of other, non-identical agents with different goals in other parts of the universe.

There are three wagers corresponding to these three assumptions. The wagers works only with those value systems that can also benefit from MSR (for instance, with total utilitarianism) (see Oesterheld, 2017, sec. 3.2). I assume such a value system in this post. I am currently working on a longer paper about a wager for (ii), which will discuss the premises for this wager in more detail.

A wager for acausal decision theory and a large universe

If this universe is very large or infinite, then it is likely that there is an identical copy of the part of the universe that is occupied by humans somewhere far-away in space (Tegmark 2003, p. 464). Moreover, there will be vastly many or infinitely many such copies. Hence, for example, if an agent prevents a small amount of suffering on Earth, this will be accompanied by many copies doing the same, resulting in multiple amounts of suffering averted throughout the universe.

Assuming causal decision theory (CDT), the impact of an agent’s copies is not taken into account when making decisions—there is an evidential dependence between the agent’s actions and the actions of their copies, but no causal influence. According to evidential decision theory (EDT), on the other hand, an agent should take such dependences into account when evaluating different choices. For EDT, a choice between two actions on Earth is also a choice between the actions of all copies throughout the universe. The same holds for all other acausal decision theories (i.e., decision theories that take such evidential dependences into account): for instance, for the decision theories developed by MIRI researchers (such as functional decision theory (Yudkowsky and Soares, 2017)), and for Poellinger’s variation of CDT (Poellinger, 2013).

Each of these considerations on its own would not be able to get a wager off the ground. But jointly, they can do so: on the one hand, given a large universe, acausal decision theories will claim a much larger impact with each action than causal decision theory does. Hence, there is a wager in favor of these acausal decision theories. Suppose an agent applies some meta decision theory (see MacAskill, 2016, sec. 2) that aggregates the expected utilities provided by individual decision theories. Even if the agent assigns a small credence to acausal decision theories, these theories will still dominate the meta decision theory’s expected utilities. On the other hand, if an agent applies an acausal decision theory, they can have a much higher impact in a large universe than in a small universe. The agent should thus always act as if the universe is large, even if they only assign a very small credence to this hypothesis.

In conclusion, most of an agent’s impact comes from applying an acausal decision theory in a large universe. Even if the agent assigns a small credence both to acausal decision theories and to the hypothesis that the universe is large, they should still act as if they placed a high credence in both.

A wager in favor of higher correlations

In explaining the third wager, it is important to note that I assume a subjective interpretation of probability. If I say that there is a correlation between the actions of two agents, I mean that, given one’s subjective beliefs, observing one agent’s action provides evidence about the other agent’s action. Moreover, I assume that agents are in a symmetrical decision situation—for instance, this is the case for two agents in a prisoner’s dilemma. If the decision situation is symmetrical, and if the agents are sufficiently similar, their actions will correlate. The theory of MSR says that agents in a large universe probably are in a symmetrical decision situation (Oesterheld, 2017, sec. 2.8).

There exists no general theory of correlations between different agents. It seems plausible to assume that a correlation between the actions of two agents must be based on a logical correlation between the decision algorithms that these two agents implement. But it is not clear how to think about the decision algorithms that humans implement, for instance, and how to decide whether two decision algorithms are functionally equivalent (Yudkowsky and Soares, sec. 3). There exist solutions to these problems only in some narrow domains—for instance, for agents represented by programs written in some specific programming language.

Hence, it is also not clear which agents’ actions in a large universe correlate, given that all are in a symmetrical decision situation. It could be that an agent’s actions correlate only with very close copies. If these copies thus share the same values as the agent, then MSR does not have any action-guiding consequences. The agent will just continue to pursue their original goal function. If, on the other hand, there are many correlating agents with different goals, then MSR has strong implications. In the latter case, there can be gains from trade between these agents’ different value systems.

Just as there is a wager for applying acausal decision theory in general, there is also a wager in favor of assuming that an agent’s actions correlate with more rather than fewer different agents. Suppose there are two hypotheses: (H1) Alice’s actions only correlate with the actions of (G1) completely identical copies of Alice, and (H2) Alice’s actions correlate with (G2) all other agents that ever gave serious consideration to MSR or some equivalent idea.

(In both cases, I assume that Alice has seriously considered MSR herself.) G1 is a subset of G2, and it is plausible that G2 is much larger than G1. Moreover, it is plausible that there are also agents with Alice’s values among the agents in G2 which are not also in G1. Suppose 1-p is Alice’s credence in H1, and p her credence in H2. Suppose further that there are n agents in G1 and m agents in G2, and that q is the fraction of agents in G2 sharing Alice’s values. All agents have the choice between (A1) only pursuing their own values, and (A2) pursuing the sum over the values of all agents in G2. Choosing A1 gives an agent 1 utilon. Suppose g denotes the possible gains from trade; that is, choosing A2 produces (1+gs utilons for each value system, where s is the fraction of agents in G2 supporting that value system. If everyone in G2 chooses A2, this produces (1+g)×q×m utilons for Alice’s value system, while, if everyone chooses A1, this produces only q×m utilons in total for Alice.

The decision situation for Alice can be summarized by the following choice matrix (assuming, for simplicity, that all correlations are perfect):

H1 H2
A1 n+c q×m
A2 (1+gq×n+c (1+gq×m

Here, the cells denote the expected utilities that EDT assigns to either of Alice’s actions given either H1 or H2. c is a constant that denotes the expected value generated by the agents in G2 that are non-identical to Alice, given H1. It plays no role in comparing A1 and A2, since, given H1, these agents are not correlated with Alice: the value will be generated no matter which action she picks. The value for H1∧A2 is unrealistically high, since it supposes the same gains from trade as H2∧A2, but this does not matter here. According to EDT, Alice should choose A2 over A1 iff

g×p×q×m > (1-pn – (1+g)×(1-pn×q.

It seems likely that q×m is larger than n—the requirement that an agent must be a copy of Alice restricts the space of agents more than that of having thought about MSR and sharing Alice’s values. Therefore, even if the gains from trade and Alice’s credence in H2 (i.e., g×p) are relatively small, g×p×q×m is still larger than n, and EDT recommends A2.

While the argument for this wager is not as strong as the argument for the first two wagers, it is still plausible. It is plausible that there are much more agents having thought about MSR and sharing a person’s values than there are identical copies of the person. Hence, if the person’s actions correlate with the actions of all the agents in the larger group, the person’s actions have a much higher impact. Moreover, in this case, they plausibly also correlate with the actions of many agents holding different values, allowing for gains from trade. Therefore, one should act as if there were more rather than fewer correlations, even if one assigns a rather low credence to that hypothesis.

Acknowledgements

I am grateful to Caspar Oesterheld and Max Daniel for helpful comments on a draft of this post.

A wager against Solomonoff induction

The universal prior assigns zero probability to non-computable universes—for instance, universes that could only be represented by Turing machines in which uncountably many locations need to be updated, or universes in which the halting problem is solved in physics. While such universes might very likely not exist, one cannot justify assigning literally zero credence to their existence. I argue that it is of overwhelming importance to make a potential AGI assign a non-zero credence to incomputable universes—in particular, universes with uncountably many “value locations”.

Here, I assume a model of universes as sets of value locations. Given a specific goal function, each element in such a set could specify an area in the universe with some finite value. If a structure contains a sub-structure, and both the structure and the sub-structure are valuable in their own regard, there could either be one or two elements representing this structure in the universe’s set of value locations. If a structure is made up of infinitely many sub-structures, all of which the goal function assigns some positive but finite value to, then this structure could (if the sum of values does not converge) possibly only be represented by infinitely many elements in the set. If the set of value locations representing a universe is countable, then the value of said universe could be the sum over the values of all elements in the set (granted that some ordering of the elements is specified). I write that a universe is “countable” if it can be represented by a finite or countably infinite set, and a universe is “uncountable” if it can only be represented by an uncountably infinite set.

A countable universe, for example, could be a regular cellular automaton. If the automaton has infinitely many cells, then, given a goal function such as total utilitarianism, the automaton could be represented by a countably infinite set of value locations. An uncountable universe, on the other hand, could be a cellular automaton in which there is a cell for each real number, and interactions between cells over time are specified by a mathematical function. Given some utility functions over such a universe, one might be able to represent the universe only by an uncountably infinite set of value locations. Importantly, even though the universe could be described in logic, it would be incomputable.

Depending on one’s approach to infinite ethics, an uncountable universe could matter much more than a countable universe. Agents in uncountable universes might—with comparatively small resource investments—be able to create (or prevent), for instance, amounts of happiness or suffering that could not be created in an entire countable universe. For instance, each cell in the abovementioned cellular automaton might consist of some (possibly valuable) structure in of itself, and the cells’ structures might influence each other. Moreover, some (uncountable) set of cells might be regarded as an agent. The agent might then be able to create a positive amount of happiness in uncountably many cells, which—at least given some definitions of value and approaches to infinite ethics—would have created more value than could ever be created in a countable universe.

Therefore, there is a wager in favor of the hypothesis that humans actually live in an uncountable universe, even if it appears unlikely given current scientific evidence. But there is also a different wager, which applies if there is a chance that such a universe exists, regardless of whether humans live in that universe. It is unclear which of the two wagers dominates.

The second wager is based on acausal trade: there might be agents in an uncountable universe that do not benefit from the greater possibilities of their universe—e.g., because they do not care about the number of individual copies of some structure, but instead care about an integral over the structures’ values relative to some measure over structures. While agents in a countable universe might be able to benefit those agents equally well, they might be much worse at satisfying the values of agents with goals sensitive to the greater possibilities in uncountable universes. Thus, due to different comparative advantages, there could be great gains from trade between agents in countable and uncountable universes.

The above example might sound outlandish, and it might be flawed in that one could not actually come up with interaction rules that would lead to anything interesting happening in the cellular automaton. But this is irrelevant. It suffices that there is only the faintest possibility that an AGI could have an acausal impact in an incomputable universe which, according to one’s goal function, would outweigh all impact in all computable universes. There probably exists a possible universe like that for most goal functions. Therefore, one could be missing out on virtually all impact if the AGI employs Solomonoff induction.

There might not only be incomputable universes represented by a set that has the cardinality of the continuum, but there might be incomputable universes represented by sets of any cardinality. In the same way that there is a wager for the former, there is an even stronger wager for universes with even higher cardinalities. If there is a universe of highest cardinality, it appears to make sense to optimize only for acausal trade with that universe. Of course, there could be infinitely many different cardinalities, so one might hope that there is some convergence as to the values of the agents in universes of ever higher cardinalities (which might make it easier to trade with these agents).

In conclusion, there is a wager in favor of considering the possibility of incomputable universes: even a small acausal impact (relative to the total resources available) in an incomputable universe could counterbalance everything humans could do in a computable universe. Crucially, an AGI employing Solomonoff induction will not consider this possibility, hence potentially missing out on unimaginable amounts of value.

Acknowledgements

Caspar Oesterheld and I came up with the idea for this post in a conversation. I am grateful to Caspar Oesterheld and Max Daniel for helpful feedback on earlier drafts of this post.

UDT is “updateless” about its utility function

Updateless decision theory (UDT) (or some variant thereof) seems to be widely accepted as the best current solution to decision theory by MIRI researchers and LessWrong users. In this short post, I outline one potential implication of being completely updateless. My intention is not to refute UDT, but to show that:

  1. It is not clear how updateless one might want to be, as this could have unforeseen consequences.
  2. If one endorses UDT, one should also endorse superrational cooperation on a very deep level.

My argument is simple, and draws on the idea of multiverse-wide superrational cooperation (MSR), which is a form of acausal trade between agents with correlated decision algorithms. Thinking about MSR instead of general acausal trade has the advantage that it seems conceptually easier, while the conclusions gained should hold in the general case as well. Nevertheless, I am very uncertain and expect the reality of acausal cooperation between AIs to look different from the picture I draw in this post.

Suppose humans have created a friendly AI with a CEV utility function and UDT as its decision theory. This version of UDT has solved the problem of logical counterfactuals and algorithmic correlation, and can readily spot any correlated agent in the world. Such an AI will be inclined to trade acausally with other agents—agents in parts of the world it does not have causal access to. This is, for instance, to achieve gains from comparative advantages given empirical circumstances, and to exploit diminishing marginal returns of pursuing any single value system at once.

For the trade implied by MSR, the AI does not have to simulate other agents and engage in some kind of Löbian bargain with them. Instead, the AI has to find out whether the agents’ decision algorithms are functionally equivalent to the AI’s decision algorithm, it has to find out about the agents’ utility functions, and it has to make sure the agents are in an empirical situation such that trade benefits both parties in expectation. (Of course, to do this, the AI might also have to perform a simulation.) The easiest trading step seems to be the one with all other agents using updateless decision theory and the same prior. In this context, it is possible to neglect many of the usual obstacles to acausal trade. These agents share everything except their utility function, so there will be little if any “friction”—as long as the compromise takes differences between utility functions into account, the correlation between the agents will be perfect. It would get more complicated if the versions of UDT diverged a bit, and if the priors were slightly different. (More on this later.) I assume here that the agents can find out about the other agents’ utility functions. Although these are logically determined by the prior, the agents might be logically uncertain, and calculating the distribution of utility functions of UDT agents might be computationally expensive. I will ignore this consideration here.

A possible approach to this trade is to effectively choose policies based on a weighted sum of the utility functions of all UDT agents in all the possible worlds contained in the AI’s prior (see Oesterheld 2017, section 2.8 for further details). Here, the weights will be assigned such that in expectation, all agents will have an incentive to pursue this sum of utility functions. It is not exactly clear how such weights will be calculated, but it is likely that all agents will adopt the same weights, and it seems clear that once this weighting is done based on the prior, it won’t change after finding out which of the possible worlds from the prior is actual (Oesterheld 2017, section 2.8.6). If all agents adopt the policy of always pursuing a sum of their utility functions, the expected marginal additional goal fulfillment for all AIs at any point in the future will be highest. The agents will act according to the “greatest good for the greatest number.” Any individual agent won’t know whether they will benefit in reality, but that is irrelevant from the updateless perspective. This becomes clear if we compare the situation to thought experiments like the Counterfactual Mugging. Even if in the actual world, the AI cannot benefit from engaging in the compromise, then it was still worth it from the prior viewpoint, since (given sufficient weight in the sum of utility functions) the AI would have stood to gain even more in another, non-actual world.

If the agents are also logically updatelessness, this reduces the information the weights of the agents’ utility functions are based on. There probably are many logical implications that could be drawn from an empirical prior and the utility functions about aspects of the trade—e.g., that the trade will benefit only the most common utility functions, that some values won’t be pursued by anyone in practice, etc.—that might be one logical implication step away from a logical prior. If the AI is logically updateless, it will always perform the action that it would have committed to before it got to know about these implications. Of course, logical updatelessness is an unresolved issue, and its implications for MSR will depend on possible solutions to the problem.

In conclusion, in order to implement the MSR compromise, the AI will start looking for other UDT agents in all possible (and, possibly, impossible) worlds in its prior. It will find out about their utility functions and calculate a weighted sum over all of them. This is what I mean by the statement that UDT is “updateless” about its utility function: no matter what utility function it starts out with, its own function might still have negligible weight in the goals the UDT AI will pursue in practice. At this point, it becomes clear that it really matters what this prior looks like. What is the distribution of the utility functions of all UDT agents given the universal prior? There might be worlds less complex than the world humans live in—for instance, a cellular automaton, such as Rule 110 or Game of Life, with a relatively simple initial state—which still contain UDT agents. Given that these worlds might have a higher prior probability than the human world, they might get a higher weight in the compromise utility function. The AI might end up maximizing the goal functions of the agents in the simplest worlds.

Is updating on your existence a sin?

One of the features of UDT is that it does not even condition the prior on the agent’s own existence—when evaluating policies, UDT also considers their implications in worlds that do not contain an instantiation of the agent, even though by the time the agent thinks its first thought, it can be sure that these worlds do not exist. This might not be a problem if one assigns high weight to a modal realism/Tegmark Level 4 universe anyway. An observation can never distinguish between a world in which all worlds exist, and one in which only the world featuring the current observation exists. So if the measure of all the “single worlds” is small, then updating on existence won’t change much.

Suppose that this is not the case. Then there might be many worlds that can already be excluded as non-actual based on the fact that they don’t contain humans. Nevertheless, they might contain UDT agents with alien goals. This poses a difficult choice: Given UDT’s prior, the AI will still cooperate with agents living in non-actual (and impossible, if the AI is logically updatelessness) worlds. This is because given UDT’s prior, it could have been not humans, but these alien agents, that turned out actual—in which case they could have benefited humans in return. On the other hand, if the AI is allowed to condition on such information, then it loses in a kind of counterfactual prisoner’s dilemma:

  • Counterfactual prisoner’s dilemma: Omega has gained total control over one universe. In the pursuit of philosophy, Omega flips a fair coin to determine which of two agents she should create. If the coin comes up heads, Omega will create a paperclip maximizer. If it comes up tails, she creates a perfectly identical agent, but with one difference: the agent is a staple maximizer. After the creation of these agents, Omega hands either of them total control over the universe and lets them know about this procedure. There are gains from trade: producing both paperclips and staples creates 60% utility for both of the agents, while producing only one of those creates 100% for one of the agents. Hence, both agents would (in expectation) benefit from a joint precommitment to a compromise utility function, even if only one of the agents is actually created. What should the created agent do?

If the agents condition on their existence, then they will not gain as much in expectation as they could otherwise expect to gain before the coin flip (when neither of the agents existed). I have chosen this thought experiment because it is not confounded by the involvement of simulated agents, a factor which could lead to anthropic uncertainty and hence make the agents more updateless than they would otherwise be.

UDT agents with differing priors

What about UDT agents using differing priors? For simplicity, I suppose there are only two agents. I also assume that both agents have equal capacity to create utilons in their universes. (If this is not the case, the weights in the resulting compromise utility function have to be adjusted.) Suppose both agents start out with the same prior, but update it on their own existence—i.e., they both exclude any worlds that don’t contain an instantiation of themselves. This posterior is then used to select policies. Agent B can’t benefit from any cooperative actions by agent A in a world that only exists in agent A’s posterior. Conversely, agent A also can’t benefit from agent B in worlds that agent A doesn’t think could be actual anymore. So the UDT policy will recommend pursuing a compromise function only in worlds lying in the intersection of worlds that exist in both agent’s posteriors. If either agent updates that they are in some of the worlds to which the other agent assigns approximately zero probability, then they won’t cooperate.

More generally, if both agents know which world is actual, and this is a world which they both inhabit, then it doesn’t matter which prior they used to select their policies. (Of course, this world must have nonzero probability in both of their priors; otherwise they wouldn’t ever update that said world is actual.) From the prior perspective, for agent A, every sacrificed utilon in this world is weighted by its prior measure of the world. Every gained utilon from agent B is also weighted by the same prior measure. So there is no friction in this compromise—if both agents decide between action a which gives themselves d utilons, and an action b which gives the other agent c utilons, then any agent will prefer option b iff c divided by this agent’s prior measure of the world is greater than d divided by the same prior measure, so iff c is greater than d. Given that there is a way to normalize both agents’ utility functions, pursuing a sum of those utility functions seems optimal.

We can even expand this to the case wherein the two agents have any differing priors with a nonempty intersection between the corresponding sets of possible worlds. In expectation, the policy that says: “if any world outside the intersection is actual: don’t compromise; if any world from the intersection is actual: do the standard UDT compromise, but use the posterior distribution in which all worlds outside the intersection have zero probability for policy selection” seems best. When evaluating this policy, both agents can weight both utilons sacrificed for others, as well as utilons gained from others, in any of the worlds from the intersection by the measure of the entire intersection in their own respective priors. This again creates a symmetrical situation with a 1:1 trade ratio between utilons sacrificed and gained.

Another case to consider is if the agents also distribute the relative weights between the worlds in the intersection differently. I think that this does not lead to asymmetries (in the sense that conditional on some of the worlds being actual, one agent stands to gain and lose more than the other agent). Suppose agent A has 30% on world S1, and 20% on World S2. Agent B, on the other hand, has 10% on world S1 and 20% on world S2. If both agents follow the policy of pursuing the sum of utility functions, given that they find themselves in either of the two shared worlds, then, ceteris paribus, both will in expectation benefit to an equal degree. For instance, let c1 (c2) be the amount of utilons either agent can create for the other agent in world S1 (S2), and d1 (d2) the respective amount agents can create for themselves. Then agent A gets either 0.3×c1+0.2×c2 or 0.3×d1+0.2×d2, while B chooses between 0.1×c1+0.2×c2 and 0.1×d1+0.2×d2. Here, it’s not the case that A prefers cooperating iff B prefers cooperating. But assuming that in expectation, c1 = c2 as well as d1 = d2, this leads to a situation where both prefer cooperation iff c1 > d1. It follows that just pursuing a sum of both agents’ utility functions is, in expectation, optimal for both agents.

Lastly, consider a combination of non-identical priors with empirical uncertainty. For UDT, empirical uncertainty between worlds translates into anthropic uncertainty about which of the possible worlds the agent inhabits. In this case, as expected, there is “friction”. For example, suppose agent A assigns p to the intersection of the worlds in both agents’ priors, while agent B assigns p/q. Before they find out whether one of the worlds from the intersection or some other world is actual, the situation is the following: B can benefit from A’s cooperation in only p/q of the worlds. A can benefit in p of the worlds from B, but for everything A does, this will only mean p/q as much to agent B. Now each agent can again either create d utilons for themselves, or perform a cooperative action that gives c utilons to the other agent in the world where the action is performed. Given uncertainty about which world is actual, if both agents choose cooperation, agent A receives c×p utilons in expectation, while agent B receives c×p/q utilons in expectation. Defection gives both agents d utilons. So for cooperation to be worth it, c×p and c×p/q both have to be greater than d. If this is the case, then if p is unequal to p/q, both agents’ gains from trade are still not equal. This appears to be a bargaining problem that doesn’t solve as easily as the examples from above.

Conclusion

I actually endorse the conclusion that humans should cooperate with all correlating agents. Although humans’ decision algorithms might not correlate with as many other agents, and they might not be able to compromise as efficiently as super-human AIs, humans should nevertheless pursue some multiverse-wide sum of values. What I’m uncertain about is how far updatelessness should go. For instance, it is not clear to me which empirical and logical evidence humans should and shouldn’t take into account when selecting policies. If an AI does not start out with the knowledge that humans possess but instead uses the universal prior, then it might perform actions that seem irrational given human knowledge. Even if observations are logically inconsistent with the existence of a fellow cooperation partner (i.e., in the updated distribution, the cooperation partner’s world has zero probability), then UDT might still cooperate with and possibly adopt that partner’s values. I doubt at this point whether everyone still agrees with the hypothesis that UDT always achieves the highest utility.

Acknowledgements

I thank Caspar Oesterheld, Max Daniel, Lukas Gloor, and David Althaus for helpful comments on a draft of this post, and Adrian Rorheim for copy editing.