Moral realism and AI alignment

Abstract”: Some have claimed that moral realism – roughly, the claim that moral claims can be true or false – would, if true, have implications for AI alignment research, such that moral realists might approach AI alignment differently than moral anti-realists. In this post, I briefly discuss different versions of moral realism based on what they imply about AI. I then go on to argue that pursuing moral-realism-inspired AI alignment would bypass philosophical and help resolve non-philosophical disagreements related to moral realism. Hence, even from a non-realist perspective, it is desirable that moral realists (and others who understand the relevant realist perspectives well enough) pursue moral-realism-inspired AI alignment research.

Different forms of moral realism and their implications for AI alignment

Roughly, moral realism is the view that “moral claims do purport to report facts and are true if they get the facts right.” So for instance, most moral realists would hold the statement “one shouldn’t torture babies” to be true. Importantly, this moral claim is different from a claim about baby torturing being instrumentally bad given some other goal (a.k.a. a “hypothetical imperative”) such as “if one doesn’t want to land in jail, one shouldn’t torture babies.” It is uncontroversial that such claims can be true or false. Moral claims, as I understand them in this post, are also different from descriptive claims about some people’s moral views, such as “most Croatians are against babies being tortured” or “I am against babies being tortured and will act accordingly”. More generally, the versions of moral realism discussed here claim that moral truth is in some sense mind-independent. It’s not so obvious what it means for a moral claim to be true or false, so there are many different versions of moral realism. I won’t go into more detail here, though we will revisit differences between different versions of moral realism later. For a general introduction on moral realism and meta-ethics, see, e.g., the SEP article on moral realism.

I should note right here that I myself find at least “strong versions” of moral realism implausible. But in this post, I don’t want to argue about meta-ethics. Instead, I would like to discuss an implication of some versions of moral realism. I will later say more about why I am interested in the implications of a view I believe to be misguided, but for now suffice it to say that “moral realism” is a majority view among professional philosophers (though I don’t know how popular the versions of moral realism studied in this post are), which makes it interesting to explore the view’s possible implications.

The implication that I am interested in here is that moral realism helps with AI alignment in some way. One very strong version of the idea is that the orthogonality thesis is false: if there is a moral truth, agents (e.g., AIs) that are able to reason successfully about a lot of non-moral things will automatically be able to reason correctly about morality as well and will then do what they infer to be morally correct. On p. 176 of “The Most Good You Can Do”, Peter Singer defends such a view: “If there is any validity in the argument presented in chapter 8, that beings with highly developed capacities for reasoning are better able to take an impartial ethical stance, then there is some reason to believe that, even without any special effort on our part, superintelligent beings, whether biological or mechanical, will do the most good they possibly can.” In the articles “My Childhood Death Spiral”, “A Prodigy of Refutation” and “The Sheer Folly of Callow Youth” (among others), Eliezer Yudkowsky says that he used to hold such a view.

Of course, current AI techniques do not seem to automatically include moral reasoning. For instance, if you develop an automated theorem prover to reason about mathematics, it will not be able to derive “moral theorems”. Similarly, if you use the Sarsa algorithm to train some agent with some given reward function, that agent will adapt its behavior in a way that increases its cumulative reward regardless of whether doing so conflicts with some ethical imperative. The moral realist would thus have to argue that in order to get to AGI or superintelligence or some other milestone, we will necessarily have to develop new and very different reasoning algorithms and that these algorithms will necessarily incorporate ethical reasoning. Peter Singer doesn’t state this explicitly. However, he makes a similar argument about human evolution on p. 86f. in ch. 8:

The possibility that our capacity to reason can play a critical role in a decision to live ethically offers a solution to the perplexing problem that effective altruism would otherwise pose for evolutionary theory. There is no difficulty in explaining why evolution would select for a capacity to reason: that capacity enables us to solve a variety of problems, for example, to find food or suitable partners for reproduction or other forms of cooperative activity, to avoid predators, and to outwit our enemies. If our capacity to reason also enables us to see that the good of others is, from a more universal perspective, as important as our own good, then we have an explanation for why effective altruists act in accordance with such principles. Like our ability to do higher mathematics, this use of reason to recognize fundamental moral truths would be a by-product of another trait or ability that was selected for because it enhanced our reproductive fitness—something that in evolutionary theory is known as a spandrel.

A slightly weaker variant of this strong convergence moral realism is the following: Not all superintelligent beings would be able to identify or follow moral truths. However, if we add some feature that is not directly normative, then superintelligent beings would automatically identify the moral truth. For example, David Pearce appears to claim that “the pain-pleasure axis discloses the world’s inbuilt metric of (dis)value” and that therefore any superintelligent being that can feel pain and pleasure will automatically become a utilitarian. At the same time, that moral realist could believe that a non-conscious AI would not necessarily become a utilitarian. So, this slightly weaker variant of strong convergence moral realism would be consistent with the orthogonality thesis.

I find all of these strong convergence moral realisms very implausible. Especially given how current techniques in AI work – how value-neutral they are – the claim that algorithms for AGI will all automatically incorporate the same moral sense seems extraordinary and I have seen little evidence for it1 (though I should note that I have read only bits and pieces of the moral realism literature).2

It even seems easy to come up with semi-rigorous arguments against strong convergence moral realism. Roughly, it seems that we can use a moral AI to build an immoral AI. Here is a simple example of such an argument. Imagine we had an AI system that (given its computational constraints) always chooses the most moral action. Now, it seems that we could construct an immoral AI system using the following algorithm: Use the moral AI to decide which action of the immoral AI system it would prevent from being taken if it could only choose one action to be prevented. Then take that action. There is a gap in this argument: perhaps the moral AI simply refuses to choose the moral actions in “prevention” decision problems, reasoning that it might currently be used to power an immoral AI. (If exploiting a moral AI was the only way to build other AIs, then this might be the rational thing to do as there might be more exploitation attempts than real prevention scenarios.) Still (without having thought about it too much), it seems likely to me that a more elaborate version of such an argument could succeed.

Here’s a weaker moral realist convergence claim about AI alignment: There’s moral truth and we can program AIs to care about the moral truth. Perhaps it suffices to merely “tell them” to refer to the moral truth when deciding what to do. Or perhaps we would have to equip them with a dedicated “sense” for identifying moral truths. This version of moral realism again does not claim that the orthogonality thesis is wrong, i.e. that sufficiently effective AI systems will automatically behave ethically without us giving them any kind of moral guidance. It merely states that in addition to the straightforward approach of programming an AI to adopt some value system (such as utilitarianism), we could also program the AI to hold the correct moral system. Since pointing at something that exists in the world is often easier than describing that thing, it might be thought that this alternative approach to value loading is easier than the more direct one.

I haven’t found anyone who defends this view (I haven’t looked much), but non-realist Brian Tomasik gives this version of moral realism as a reason to discuss moral realism:

Moral realism is a fun philosophical topic that inevitably generates heated debates. But does it matter for practical purposes? […] One case where moral realism seems problematic is regarding superintelligence. Sometimes it’s argued that advanced artificial intelligence, in light of its superior cognitive faculties, will have a better understanding of moral truth than we do. As a result, if it’s programmed to care about moral truth, the future will go well. If one rejects the idea of moral truth, this quixotic assumption is nonsense and could lead to dangerous outcomes if taken for granted.

(Below, I will argue that there might be no reason to be afraid of moral realists. However, my argument will, like Brian’s, also imply that moral realism is worth debating in the context of AI.)

As an example, consider a moral realist view according to which moral truth is similar to mathematical truth: there are some axioms of morality which are true (for reasons I, as a non-realist, do not understand or agree with) and together these axioms imply some moral theory X. This moral realist view suggests an approach to AI alignment: program the AI to abide by these axioms (in the same way as we can have automated theorem provers assume some set of mathematical axioms to be true). It seems clear that something along these lines could work. However, this approach’s reliance on moral realism is also much weaker.

As a second example, divine command theory states that moral truth is determined by God’s will (again, I don’t see why this should be true and how it could possibly be justified). A divine command theorist might therefore want to program the AI to do whatever God wants it to do.

Here are some more such theories:

  • Social contract
  • Habermas’ discourse ethics
  • Universalizability / Kant’s categorical imperative
  • Applying human intuition

Besides pointing being easier than describing, another potential advantage of such a moral realist approach might be that one is more confident in one’s meta-ethical view (“the pointer”) than in one’s object-level moral system (“one’s own description”). For example, someone could be confident that moral truth is determined by God’s will but be unsure that God’s will is expressed via the Bible, the Quran or something else, or how these religious texts are to be understood. Then that person would probably favor AI that cares about God’s will over AI that follows some particular interpretation of, say, the moral rules proposed in the Quran and Sharia.

A somewhat related issue which has received more attention in the moral realism literature is the convergence of human moral views. People have given moral realism as an explanation for why there is near-universal agreement on some ethical views (such as “when religion and tradition do not require otherwise, one shouldn’t torture babies”). Similarly, moral realism has been associated with moral progress in human societies, see, e.g., Huemer (2016). At the same time, people have used the existence of persisting and unresolvable moral disagreements (see, e.g., Bennigson 1996 and Sayre-McCord 2017, sect. 1) and the existence of gravely immoral behavior in some intelligent people (see, e.g., Nichols 2002) as arguments against moral realism. Of course, all of these arguments take moral realism to include a convergence thesis where being a human (and perhaps not being affected by some mental disorders) or a being a society of humans is sufficient to grasp and abide by moral truth.

Of course, there are also versions of moral realism that have even weaker (or just very different) implications for AI alignment and do not make any relevant convergence claims (cf. McGrath 2010). For instance, there may be moral realists who believe that there is a moral truth but that machines are in principle incapable of finding out what it is. Some may also call very different views “moral realism”, e.g. claims that given some moral imperative, it can be decided whether an action does or does not comply with that imperative. (We might call this “hypothetical imperative realism”.) Or “linguistic” versions of moral realism which merely make claims about the meaning of moral statements as intended by whoever utters these moral statements. (Cf. Lukas Gloor’s post on how different versions of moral realism differ drastically in terms of how consequential they are.) Or a kind of “subjectivist realism”, which drops mind-independence (cf. Olson 2014, ch. 2).

Why moral-realism-inspired research on AI alignment might be useful

I can think of many reasons why moral realism-based approaches to AI safety have not been pursued much: AI researchers often do not have a sufficiently high awareness of or interest in philosophical ideas; the AI safety researchers who do – such as researchers at MIRI – tend to reject moral realism, at least the versions with implications for AI alignment; although “moral realism” is popular among philosophers, versions of moral realism with strong implications for AI (à la Peter Singer or David Pearce) might be unpopular even among philosophers (cf. again Lukas’ post on how different versions of moral realism differ drastically in terms of how consequential they are); and so on…

But why am I now proposing to conduct such research, given that I am not a moral realist myself? The main reason (besides some weaker reasons like pluralism and keeping this blog interesting) is that I believe AI alignment research from a moral realist perspective might actually increase agreement between moral realists and anti-realists about how (and to which extent) AI alignment research should be done. In the following, I will briefly argue this case for the strong (à la Peter Singer and David Pearce) and the weak convergence versions of moral realism outlined above.

Strong versions

Like most problems in philosophy, the question of whether moral realism is true lacks an accepted truth condition or an accepted way of verifying an answer or an argument for either realism or anti-realism. This is what makes these problems so puzzling and intractable. This is in contrast to problems in mathematics where it is pretty clear what counts as a proof of a hypothesis. (This is, of course, not to say that mathematics involves no creativity or that there are no general purpose “tools” for philosophy.) However, the claim made by strong convergence moral realism is more like a mathematical claim. Although it is yet to be made precise, we can easily imagine a mathematical (or computer-scientific) hypothesis stating something like this: “For any goal X of some kind [namely the objectively incorrect and non-trivial-to-achieve kind] there is no efficient algorithm that when implemented in a robot achieves X in some class of environments. So, for instance, it is in principle impossible to build a robot that turns Earth into a pile of paperclips.” It may still be hard to formalize such a claim and mathematical claims can still be hard to prove or disprove. But determining the truth of a mathematical statement is not a philosophical problem, anymore. If someone lays out a mathematical proof or disproof of such a claim, any reasonable person’s opinion would be swayed. Hence, I believe that work on proving or disproving this strong version of moral realism will lead to (more) agreement on whether the “strong-moral-realism-based theory of AI alignment” is true.

It is worth noting that finding out whether strong convergence is true may not resolve metaphysical issues. Of course, all strong versions of moral realism would turn out false if the strong convergence hypothesis were falsified. But other versions of moral realism would survive. Conversely, if the strong convergence hypothesis turned out to be true, then anti-realists may remain anti-realists (cf. footnote 2). But if our goal is to make AI moral, the convergence question is much more important than the metaphysical question. (That said, for some people the metaphysical question has a bearing on whether they have preferences over AI systems’ motivation system – “if no moral view is more true than any other, why should I care about what AI systems do?”)

Weak versions

Weak convergence versions of moral realism do not make such in-principle-testable predictions. Their only claim is the metaphysical view that the goals identified by some method X (such as derivation from a set moral axioms, finding out what God wants, discourse, etc.) have some relation to moral truths. Thinking about weak convergence moral realism from the more technical AI alignment perspective is therefore unlikely to resolve disagreements about whether some versions of weak convergence moral realism are true. However, I believe that by not making testable predictions, weak convergence versions of moral realism are also unlikely to lead to disagreement about how to achieve AI alignment.

Imagine moral realists were to propose that AI systems should reason about morality according to some method X on the basis that the result of applying X is the moral truth. Then moral anti-realists could agree with the proposal on the basis that they (mostly) agree with the results of applying method X. Indeed, for any moral theory with realist ambitions, ridding that theory of these ambitions yields a new theory which an anti-realist could defend. As an example, consider Habermas’ discourse ethics and Yudkowsky’s Coherent Extrapolated Volition. The two approaches to justifying moral views seem quite similar – roughly: do what everyone would agree with if they were exposed to more arguments. But Habermas’ theory explicitly claims to be realist while Yudkowsky is a moral anti-realist, as far as I can tell.

In principle, it could be that moral realists defend some moral view on the grounds that it is true even if it seems implausible to others. But here’s a general argument for why this is unlikely to happen. You cannot directly perceive ought statements (David Pearce and others would probably disagree) and it is easy to show that you cannot derive a statement containing an ought without using other statements containing an ought or inference rules that can be used to introduce statements containing an ought. Thus, if moral realism (as I understand it for the purpose of this paper) is true, there must be some moral axioms or inference rules that are true without needing further justification, similar to how some people view the axioms of Peano arithmetic or Euclidean geometry. An example of such a moral rule could be (a formal version of) “pain is bad”. But if these rules are “true without needing further justification”, then they are probably appealing to anti-realists as well. Of course, anti-realists wouldn’t see them as deserving the label of “truth” (or “falsehood”), but assuming that realists and anti-realists have similar moral intuitions, anything that a realist would call “true without needing further justification” should also be appealing to a moral anti-realist.

As I have argued elsewhere, it’s unlikely we will ever come up with (formal) axioms (or methods, etc.) for morality that would be widely accepted by the people of today (or even among today’s Westerners with secular ethics). But I still think it’s worth a try. If it doesn’t work out, weak convergence moral realists might come around to other approaches to AI alignment, e.g. ones based on extrapolating from human intuition.

Other realist positions

Besides realism about morality, there are many other less commonly discussed realist positions, for instance, realism about which prior probability distribution to use, whether to choose according to some expected value maximization principle (and if so which one), etc. The above considerations apply to these other realist positions as well.


1. There are some “universal instrumental goal” approaches to justifying morality. Some are based on cooperation and work roughly like this: “Whatever your intrinsic goals are, it is often better to be nice to others so that they reciprocate. That’s what morality is.” I think such theories fail for two reasons: First, there seem to many widely accepted moral imperatives that cannot be fully justified by cooperation. For example, we usually consider it wrong for dictators to secretly torture and kill people, even if doing so has no negative consequences for them. Second, being nice to others because one hopes that they reciprocate is not, I think, what morality is about. To the contrary, I think morality is about caring things (such as other people’s welfare) intrinsically. I discuss this issue in detail with a focus on so-called “superrational cooperation” in chapter 6.7 of “Multiverse-wide Cooperation via Correlated Decision Making”. Another “universal instrumental goal” approach is the following: If there is at least one god, then not making these gods angry at you may be another universal instrumental goal, so whatever an agent’s intrinsic goal is, it will also act according to what the gods want. The same “this is not what morality is about” argument seems to apply.

2. Yudkowsky has written about why he now rejects this form of moral realism in the first couple of blog posts in the “Value Theory” series.

Goertzel’s GOLEM implements evidential decision theory applied to policy choice

I’ve written about the question of which decision theories describe the behavior of approaches to AI like the “Law of Effect”. In this post, I would like to discuss GOLEM, an architecture for a self-modifying artificial intelligence agent described by Ben Goertzel (2010; 2012). Goertzel calls it a “meta-architecture” because all of the intelligent work of the system is done by sub-programs that the architecture assumes as given, such as a program synthesis module (cf. Kaiser 2007).

Roughly, the top-level self-modification is done as follows. For any proposal for a (partial) self-modification, i.e. a new program to replace (part of) the current one, the “Predictor” module predicts how well that program would achieve the goal of the system. Another part of the system — the “Searcher” — then tries to find programs that the Predictor deems superior to the current program. So, at the top level, GOLEM chooses programs according to some form of expected value calculated by the Predictor. The first interesting decision-theoretical statement about GOLEM is therefore that it chooses policies — or, more precisely, programs — rather than individual actions. Thus, it would probably give the money in at least some versions of counterfactual mugging. This is not too surprising, because it is unclear on what basis one should choose individual actions when the effectiveness of an action depends on the agent’s decisions in other situations.

The next natural question to ask is, of course, what expected value (causal, evidential or other) the Predictor computes. Like the other aspects of GOLEM, the Predictor is subject to modification. Hence, we need to ask according to what criteria it is updated. The criterion is provided by the Tester, a “hard-wired program that estimates the quality of a candidate Predictor” based on “how well a Predictor would have performed in the past” (Goertzel 2010, p. 4). I take this to mean that the Predictor is judged based the extent to which it is able to predict the things that actually happened in the past. For instance, imagine that at some time in the past the GOLEM agent self-modified to a program that one-boxes in Newcomb’s problem. Later, the agent actually faced a Newcomb problem based on a prediction that was made before the agent self-modified into a one-boxer and won a million dollars. Then the Predictor should be able to predict that self-modifying to one-boxing in this case “yielded” getting a million dollar even though it did not do so causally. More generally, to maximize the score from the Tester, the Predictor has to compute regular (evidential) conditional probabilities and expected utilities. Hence, it seems that the EV computed by the Predictor is a regular EDT-ish one. This is not too surprising, either, because as we have seen before, it is much more common for learning algorithms to implement EDT, especially if they implement something which looks like the Law of Effect.

In conclusion, GOLEM learns to choose policy programs based on their EDT-expected value.

Acknowledgements

This post is based on a discussion with Linda Linsefors, Joar Skalse, and James Bell.

Market efficiency and charity cost-effectiveness

In an efficient market, one can expect that most goods are sold at a price-quality ratio that is hard to improve upon. If there was some easy way to produce a product cheaper or to produce a higher-quality version of it for a similar price, someone else would probably have seized that opportunity already – after all, there are many people who are interested in making money. Competing with and outperforming existing companies thus requires luck, genius or expertise. Also, if you trust other buyers to be reasonable, you can more or less blindly buy any “best-selling” product.

Several people, including effective altruists, have remarked that this is not true in the case of charities. Since most donors don’t systematically choose the most cost-effective charities, most donations go to charities that are much less cost-effective than the best ones. Thus, if you sit on a pile of resources – your career, say – outperforming the average charity at doing good is fairly easy.

The fact that charities don’t compete for cost-effectiveness doesn’t mean there’s no competition at all. Just like businesses in the private sector compete for customers, charities compete for donors. It just happens to be the case that being good at convincing people to donate doesn’t correlate strongly with cost-effectiveness.

Note that in the private sector, too, there can be a misalignment between persuading customers and producing the kind of product you are interested in, or even the kind of product that customers in general will enjoy or benefit from using. Any example will be at least somewhat controversial, as it will suggest that buyers make suboptimal choices. Nevertheless, I think addictive drugs like cigarettes are an example that many people can agree with. Cigarettes seem to provide almost no benefits to consumers, at least relative to taking nicotine directly. Nevertheless, people buy them, perhaps because smoking is associated with being cool or because they are addictive.

One difference between competition in the for-profit and nonprofit sectors is that the latter lacks monetary incentives. It’s nearly impossible to become rich by founding or working at a charity. Thus, people primarily interested in money won’t start a charity, even if they have developed a method of persuading people of some idea that is much more effective than existing methods. However, making a charity succeed is still rewarded with status and (the belief in) having had an impact. So in terms of persuading people to donate, the charity “market” is probably somewhat efficient in areas that confer status and that potential founders and employees intrinsically care about.

If you care about investing your resource pile most efficiently, this efficiency at persuading donors offers little consolation. On the contrary, it even predicts that if you use your resources to found or support an especially cost-effective charity, fundraising will be difficult. Perhaps you previously thought that, since your charity is “better”, it will also receive more donations than existing ineffective charities. But now it seems that if cost-effectiveness really helped with fundraising, more charities would have already become more cost-effective.

There are, however, cause areas in which the argument about effectiveness at persuasion carries a different tone. In these cause areas, being good at fundraising strongly correlates with being good at what the charity is supposed to do. An obvious example is that of charities whose goal it is to fundraise for other charities, such as Raising for Effective Giving. (Disclosure: I work for REG’s sister organization FRI and am a board member of REG’s parent organization EAF.) If an organization is good at fundraising for itself, it’s probably also good at fundraising for others. So if there are already lots of organizations whose goal it is to fundraise for other organizations, one might expect that these organizations already do this job so well that they are hard to outperform in terms of money moved per resources spent. (Again, some of these may be better because they fundraise for charities that generate more value according to your moral view.)

Advocacy is another cause area in which successfully persuading donors correlates with doing a very good job overall. If an organization can persuade people to donate and volunteer to promote veganism, it seems plausible that they are also good at promoting veganism. Perhaps most of the organization’s budget even comes from people they persuaded to become vegan, in which case their ability to find donors and volunteers is a fairly direct measure of their ability to persuade people to adopt a vegan diet. (Note that I am, of course, not saying that competition ensures that organizations persuade people of the most useful ideas.) As with fundraising organizations, this suggests that it’s hard to outperform advocacy groups in areas where lots of people have incentives to advocate, because if there were some simple method of persuading people, it’s very likely that some large organization based on that method would have already been established.

That said, there are many caveats to this argument for a strong correlation between fundraising and advocacy effectiveness. First off, for many organizations, fundraising appears to be primarily about finding, retaining and escalating a small number of wealthy donors. For some organizations, a similar statement might be true about finding volunteers and employees. In contrast, the goal of most advocacy organizations is to persuade a large number of people.1 So there may be organizations whose members are very persuasive in person and thus capable of bringing in many large donors, but who don’t have any idea about how to run a large-scale campaign oriented toward “the masses”. When trying to identify cost-effective advocacy charities, this problem can, perhaps, be addressed by giving some weight to the number of donations that a charity brings in, as opposed to donation sizes alone.2 However, the more important point is that if growing big is about big donors, then a given charity’s incentives and selection pressures for survival and growth are misaligned with persuading many people. Thus, it becomes more plausible again that the average big or fast-growing advocacy-based charity is a suboptimal use of your resource pile.

Second, I stipulated that a good way of getting new donors and volunteers is to simply persuade as many people of your general message as possible, and then hope that some of these will also volunteer at or donate to your organization. But even if all donors contribute similar amounts, some target audiences are more likely to donate than others.3 In particular, people seem more likely to contribute larger amounts if they have been involved for longer, have already donated or volunteered, and/or hold a stronger or more radical version of your organization’s views. But persuading these community members to donate works in very different ways than persuading new people. For example, being visible to the community becomes more important. Also, if donating is about identity and self-expression, it becomes more important to advocate in ways that express the community’s shared identity rather than in ways that are persuasive but compromising. The target audiences for fundraising and advocacy may also vary a lot along other dimensions: for example, to win an election, a political party has to persuade undecided voters, who tend to be uninformed and not particularly interested in politics (see p. 312 of Achen and Bartel’s Democracy for Realists); but to collect donations, one has to mobilize long-term party members who probably read lots of news, etc.

Third, the fastest-growing advocacy organizations may have large negative externalities.4 Absent regulations and special taxes, the production of the cheapest products will often damage some public good, e.g., through carbon emissions or the corruption of public institutions. Similarly, advocacy charities may damage some public good. The fastest way to find new members may involve being overly controversial, dumbing down the message or being associated with existing powerful interests, which may damage the reputation of a movement. For example, the neoliberals often suffer from being associated with special/business interests and crony capitalism (see sections “Creating a natural constituency” and “Cooption” in Kerry Vaughan’s What the EA community can learn from the rise of the neoliberals), perhaps because associating with business interests often carries short-term benefits for an individual actor. Again, this suggests that the fastest-growing advocacy charity may be much worse overall than the optimal one.

Acknowledgements

I thank Jonas Vollmer, Persis Eskander and Johannes Treutlein for comments.


1. Lobbying organizations, which try to persuade individual legislators, provide a useful contrast. Especially in countries with common law, organizations may also attempt to win individual legal cases.

2. One thing to keep in mind is that investing effort into persuading big donors is probably a good strategy for many organizations. Thus, a small-donor charity that grows less quickly than a big-donor charity may be be more or less cost-effective than the big-donor charity.

3. One of the reasons why one might think that drawing in new people is most effective is that people who are already in the community and willing to donate to an advocacy org probably just fund the charity that persuaded them in the first place. Of course, many people may simply not follow the sentiment of donating to the charity that persuaded them. However, many community members may have been persuaded in ways that don’t present such a default option. For example, many people were persuaded to go vegan by reading Animal Liberation. Since the book’s author, Peter Singer, has no room for more funding, these people have to find other animal advocacy organizations to donate to.

4. Thanks to Persis Eskander for bringing up this point in response to an early version of this post.

The law of effect, randomization and Newcomb’s problem

The law of effect (LoE), as introduced on p. 244 of Thorndike’s (1991) Animal Intelligence, states:

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond.

As I (and others) have pointed out elsewhere, an agent applying LoE would come to “one-box” (i.e., behave like evidential decision theory (EDT)) in Newcomb-like problems in which the payoff is eventually observed. For example, if you face Newcomb’s problem itself multiple times, then one-boxing will be associated with winning a million dollars and two-boxing with winning only a thousand dollars. (As noted in the linked note, this assumes that the different instances of Newcomb’s problem are independent. For instance, one-boxing in the first does not influence the prediction in the second. It is also assumed that CDT cannot precommit to one-boxing, e.g. because precommitment is impossible in general or because the predictions have been made long ago and thus cannot be causally influenced anymore.)

A caveat to this result is that with randomization one can derive more causal decision theory-like behavior from alternative versions of LoE. Imagine an agent that chooses probability distributions over actions, such as the distribution P with P(one-box)=0.8 and P(two-box)=0.2. The agent’s physical action is then sampled from that probability distribution. Furthermore, assume that the predictor in Newcomb’s problem can only predict the probability distribution and not the sampled action and that he fills box B with the probability the agent chooses for one-boxing. If this agent plays many instances of Newcomb’s problem, then she will ceteris paribus fare better in rounds in which she two-boxes. By LoE, she may therefore update toward two-boxing being the better option and consequently two-box with higher probability. Throughout the rest of this post, I will expound on the “goofiness” of this application of LoE.

Notice that this is not the only possible way to apply LoE. Indeed, the more natural way seems to be to apply LoE only to whatever entity the agent has the power to choose rather than something that is influenced by that choice. In this case, this is the probability distribution and not the action resulting from that probability distribution. Applied at the level of the probability distribution, LoE again leads to EDT. For example, in Newcomb’s problem the agent receives more money in rounds in which it chooses a higher probability of one-boxing. Let’s call this version of LoE “standard LoE”. We will call other versions, in which choice is updated to bring some other variable (in this case the physical action) to assume values that are associated with high payoffs, “non-standard LoE”.

Although non-standard LoE yields CDT-ish behavior in Newcomb’s problem, it can easily be criticized on causalist grounds. Consider a non-Newcomblike variant of Newcomb’s problem in which there is no predictor but merely an entity that reads the agent’s mind and fills box B with a million dollars in causal dependence on the probability distribution chosen by the agent. The causal graph representing this decision problem is given below with the subject of choice being marked red. Unless they are equipped with an incomplete model of the world – one that doesn’t include the probability distribution step –, CDT and EDT agree that one should choose the probability distribution over actions that one-boxes with probability 1 in this variant of Newcomb’s problem. After all, choosing that probability distribution causes the game master to see that you will probably one-box and thus also causes him to put money under box B. But if you play this alternative version of Newcomb’s problem and use LoE on the level of one- versus two-boxing, then you would converge on two-boxing because, again, you will fare better in rounds in which you happen to two-box.

RandomizationBlogPost.jpg

Be it in Newcomb’s original problem or in this variant of Newcomb’s problem, non-standard LoE can lead to learning processes that don’t seem to match LoE’s “spirit”. When you apply standard LoE (and probably also in most cases of applying non-standard LoE), you develop a tendency to exhibit rewarded choices, and this will lead to more reward in the future. But if you adjust your choices with some intermediate variable in mind, you may get worse and worse. For instance, in either the regular or non-Newcomblike Newcomb’s problem, non-standard LoE adjusts the choice (the probability distribution over actions) so that the (physically implemented) action is more likely to be the one associated with higher reward (two-boxing), but the choice itself (high probability of two-boxing) will be one that is associated with low rewards. Thus, learning according to non-standard LoE can lead to decreasing rewards (in both Newcomblike and non-Newcomblike problems).

All in all, what I call non-standard LoE looks a bit like a hack rather than some systematic, sound version of CDT learning.

As a side note, the sensitivity to the details of how LoE is set up relative to randomization shows that the decision theory (CDT versus EDT versus something else) implied by some agent design can sometimes be very fragile. I originally thought that there would generally be some correspondence between agent designs and decision theories, such that changing the decision theory implemented by an agent usually requires large-scale changes to the agent’s architecture. But switching from standard LoE to non-standard LoE is an example where what seems like a relatively small change can significantly change the resulting behavior in Newcomb-like problems. Randomization in decision markets is another such example. (And the Gödel machine is yet another example, albeit one that seems less relevant in practice.)

Acknowledgements

I thank Lukas Gloor, Tobias Baumann and Max Daniel for advance comments.

Pearl on causality

Here’s a quote by Judea Pearl (from p. 419f. of the Epilogue of the second edition of Causality) that, in light of his other writing on the topic, I found surprising when I first read it:

Let us examine how the surgery interpretation resolves Russell’s enigma concerning the clash between the directionality of causal relations and the symmetry of physical equations. The equations of physics are indeed symmetrical, but when we compare the phrases “A causes B” versus “B causes A,” we are not talking about a single set of equations. Rather, we are comparing two world models, represented by two different sets of equations: one in which the equation for A is surgically removed; the other where the equation for B is removed. Russell would probably stop us at this point and ask: “How can you talk about two world models when in fact there is only one world model, given by all the equations of physics put together?” The answer is: yes. If you wish to include the entire universe in the model, causality disappears because interventions disappear – the manipulator and the manipulated lose their distinction. However, scientists rarely consider the entirety of the universe as an object of investigation. In most cases the scientist carves a piece from the universe and proclaims that piece in – namely, the focus of investigation. The rest of the universe is then considered out or background and is summarized by what we call boundary conditions. This choice of ins and outs creates asymmetry in the way we look at things, and it is this asymmetry that permits us to talk about “outside intervention” and hence about causality and cause-effect directionality.

Futarchy implements evidential decision theory

Futarchy is a meta-algorithm for making decisions using a given set of traders. For every possible action a, the beliefs of these traders are aggregated using a prediction market for that action, which, if a is actually taken, evaluates to an amount of money that is proportional to how much utility is received. If a is not taken, the market is not evaluated, all trades are reverted, and everyone keeps their original assets. The idea is that – after some learning and after bad traders lose most of their money to competent ones – the market price for a will come to represent the expected utility of taking that action. Futarchy then takes the action whose market price is highest.

For a more detailed description, see, e.g., Hanson’s (2007) original paper on the futarchy, which also discusses potential objections. For instance, what happens in markets for actions that are very unlikely to be chosen? Note, however, that for this blog post you’ll only need to understand the basic concept and none of the minutia of real-world implementation. The above description deliberately ignores and abstracts away from these. One example of such a discrepancy between standard descriptions of futarchy and my above account is that, in real-world governance, there is often a “default action” (such as, leave law and government as is). To keep the number of markets small, markets are set up to evaluate proposed changes relative to that default (such as the introduction of a new law) rather than simply for all possible actions. I should also note that I only know basic economics and am not an expert on the futarchy.

Traditionally, the futarchy has been thought of as a decision-making procedure for governance of human organizations. But in principle, AIs could be built on futarchies as well. Of course, many approaches to AI (such as most Deep Learning-based ones) already have all their knowledge concentrated into a single entity and thus don’t need any procedure (such as democracy’s voting or futarchy’s markets) to aggregate the beliefs of multiple entities. However, it has also been proposed that intelligence arises from the interaction and sometimes competition of a large number of simple subagents – see, for instance, Minsky’s book The Society of Mind, Dennett’s Consciousness Explained, and the modularity of mind hypothesis. Prediction markets and futarchies would be approaches to (or models of) combining the opinions of many of these agents, though I doubt that the human mind functions like either of the two. A theoretical example of the use of prediction markets in AI is MIRI’s logical induction paper. Furthermore, markets are generally similar to evolutionary algorithms.1

So, if we implement a futarchy-like system in an AI, what decision theory would that AI come to implement? It seems that the answer is EDT. Consider Newcomb’s problem as an example. Traders that predict one-boxing to yield a million and two-boxing to yield a thousand will earn money, since the agent will, in fact, receive a million if it one-boxes and a thousand if it two-boxes. More generally, the futarchy rewards traders based on how accurately they predict what is actually going to happen if the agent makes a particular choice. This leads the traders to estimate the value of an action as proportional to the expected utility conditional on that action since conditional probabilities are the correct way to make predictions.

There are some caveats, though. For instance, prediction markets only work if the question at hand can eventually be answered. Otherwise, the market cannot be evaluated. For instance, in Newcomb’s problem, one would usually assume that your winnings are eventually given and thus shown to you. But other versions of Newcomb’s problems are conceivable. For instance, if you are consequentialist, Omega could donate your winnings to your favorite charity in such a way that you will never be able to tell how much utility this has generated for you. Unless you simply make estimates – in which case the behavior of the markets depends primarily on what kind of expected value (regular or causal) you will use as an estimate –, you cannot set up a prediction market for this problem at all. An example of such a “hidden” Newcomb problem is cooperation via correlated decision making between distant agents.

Another unaddressed issue is whether the futarchy can deal correctly with other problems of space-time embedded intelligence, such as the BPB problem.

Notwithstanding the caveats, EDT seems to be an inherent the way the futarchy works. To get the futarchy to implement CDT, it would have to reward traders based on what the agent is causally responsible for or based on some untestable counterfactual (“what would have happened if I had two-boxed”). Whereas EDT arises naturally from the principles of the futarchy, other decision theories require modification and explicit specification.

I should mention that this post is not primarily intended as a futarchist argument for EDT. Most readers will already be familiar with the underlying pro-EDT argument, i.e., EDT making decisions based on what will actually happen if a particular decision is made. In fact, it may also be viewed as a causalist argument against the futarchy.2 Rather than either of these two, it is a small part of the answer to the “implementation problem of decision theory”, which is: if you want to create an AI that behaves in accordance to some particular decision theory, how should that AI be designed? Or, conversely, if you build an AI without explicitly implementing a specific decision theory, what kind of behavior (EDT or CDT or other) results from it?


1. There is some literature comparing the way markets function to evolution-like selection (see the first section of Blume and Easley 1992) – i.e., how irrational traders are weeded out and rational traders accrue more and more capital. I haven’t read much of that literature, but the main differences between the futarchy and evolutionary algorithms seem to be the following. First, the futarchy doesn’t specify how new traders are generated, because it classically relies on humans to do the betting (and the creation of new automated trading systems), whereas this is a central concern in evolutionary algorithms. Second, futarchies permanently leave the power in the hands of many algorithms, whereas evolutionary algorithms eventually settle for one. This also means that the individual traders in a futarchy can be permanently narrow and specialized. For instance, there could be traders who exploit a single pattern and rarely bet at all. I wonder whether it makes sense to combine evolutionary algorithms and prediction markets. 

2. Probably futarchist governments wouldn’t face sufficiently many Newcomb-like situations in which the payoff can be tested for the difference to be relevant (see chapter 4 of Arif Ahmed’s Evidence, Decision and Causality).

A behaviorist approach to building phenomenological bridges

A few weeks ago, I wrote about the BPB problem and how it poses a problem for classical/non-logical decision theories. In my post, I briefly mentioned a behaviorist approach to BPB, only to immediately discard it:

One might think that one could map between physical processes and algorithms on a pragmatic or functional basis. That is, one could say that a physical process A implements a program p to the extent that the results of A correlate with the output of p. I think this idea goes into the right direction and we will later see an implementation of this pragmatic approach that does away with naturalized induction. However, it feels inappropriate as a solution to BPB. The main problem is that two processes can correlate in their output without having similar subjective experiences. For instance, it is easy to show that Merge sort and Insertion sort have the same output for any given input, even though they have very different “subjective experiences”.

Since writing the post I became more optimistic about this approach because the counterarguments I mentioned aren’t particularly persuasive. The core of the idea is the following: Let A and B be parameterless algorithms1. We’ll say that A and B are equivalent if we believe that A outputs x iff B outputs x. In the context of BPB, your current decision is an algorithm A and we’ll say B is an instance or implementation of A/you iff A and B are equivalent. In the following sections, I will discuss this approach in more detail.

You still need interpretations

The definition only solves one part of the BPB problem: specifying equivalence between algorithms. This would solve BPB if all agents were bots (rather than parts of a bot or collections of bots) in Soares and Fallenstein’s Botworld 1.0. But in a world without any Cartesian boundaries, one still has to map parts of the environment to parameterless algorithms. This could, for instance, be a function from histories of the world onto the output set of the algorithm. For example, if one’s set of possible world models is a set of cellular automata (CA) with various different initial conditions and one’s notion of an algorithm is something operating on natural numbers, then such an interpretation i would be a function from CA histories to the set of natural numbers. Relative to i, a CA with initial conditions contains an instance of algorithm A if A outputs x <=> i(H)=x, where H is a random variable representing the history created by that CA. So, intuitively, i is reading A’s output off from a description the world. For example, it may look at the physical signals sent by a robot’s microprocessor to a motor and convert these into the output alphabet of A. E.g., it may convert a signal that causes a robot’s wheels to spin to something like “forward”. Every interpretation i is a separate instance of A.

Joke interpretations

Since we still need interpretations, we still have the problem of “joke interpretations” (Drescher 2006, sect. 2.3; also see this Brian Tomasik essay and references therein). In particular, you could have an interpretation i that does most of the work, so that the equivalence of A and i(H) is the result of i rather than the CA doing something resembling A.

I don’t think it’s necessarily a problem that an EDT agent might optimize its action too much for the possibility of being a joke instantiation, because it gives all its copies in a world equal weight no matter which copy it believes to be. As an example, imagine that there is a possible world in which joke interpretations lead to you to identify with a rock. If the rock’s “behavior” does have a significant influence on the world and the output of your algorithm correlates strongly with it, then I see no problem with taking the rock into account. At least, that is what EDT would do anyway if it has a regular copy in that world.2 If the rock has little impact on the world, EDT wouldn’t care much about the possibility of being the rock. In fact, if the world also contains a strongly correlated non-instance3 of you that faces a real decision problem, then the rock joke interpretation would merely lead you to optimize for the action of that non-copy.

If you allow all joke interpretations, then you would view yourself in all worlds. Thus, the view may have similar implications as the l-zombie view where the joke interpretations serve as the l-zombies.4 Unless we’re trying to metaphysically justify the l-zombie view, this is not what we’re looking for. So, we may want to remove “joke interpretations” in some way. One idea could be to limit the interpretation’s computational power (Aaronson 2011, sect. 6). My understanding is that this is what people in CA theory use to define the notion of implementing an algorithm in a CA, see, e.g., Cook (2004, sect. 2). Another idea would be to include only interpretations that you yourself (or A itself) “can easily predict or understand”. Assuming that A doesn’t know its own output already, this means that i cannot do most of the work necessary to entangle A with i(H). (For a similar point, cf. Bishop 2004, sect. “Objection 1: Hofstadter, ‘This is not science’”.) For example, if i would just compute A without looking at H, then A couldn’t predict i very well if it cannot predict itself. If, on the other hand, i reads off the result of A from a computer screen in H, then A would be able to predict i’s behavior for every instance of H. Brian Tomasik lists a few more criteria to judge interpretations by.

Introspective discernibility

In my original rejection of the behaviorist approach, I made an argument about two sorting algorithms which always compute the same result but have different “subjective experiences”. I assumed that a similar problem could occur when comparing two equivalent decision-making procedures with different subjective experiences. But now I actually think that the behaviorist approach nicely aligns with what one might call introspective discernibility of experiences.

Let’s say I’m an agent that has, as a component, a sorting algorithm. Now, a world model may contain an agent that is just like me except that it uses a different sorting algorithm. Does that agent count as an instantiation of me? Well, that depends on whether I can introspectively discern which sorting algorithm I use. If I can, then I could let my output depend on the content of the sorting algorithm. And if I do that, then the equivalence between me and that other agent breaks. E.g., if I decide to output an explanation of my sorting algorithm, then my output would explain, say, bubble sort, whereas the other algorithm’s output would explain, say, merge sort. If, on the other hand, I don’t have introspective access to my sorting algorithm, then the code of the sorting algorithm cannot affect my output. Thus, the behaviorist view would interpret the other agent as an instantiation of me (as long as, of course, it, too, doesn’t have introspective access to its sorting algorithm). This conforms with the intuition that which kind of sorting algorithm I use is not part of my subjective experience. I find this natural relation to introspective discernibility very appealing.

That said, things are complicated by the equivalence relation being subjective. If you already know what A and B output, then they are equivalent if their output is the same — even if it is “coincidentally” so, i.e., if they perform completely unrelated computations. Of course, a decision algorithm will rarely know its own output in advance. So, this extreme case is probably rare. However, it is plausible that an algorithm’s knowledge about its own behavior excludes some conditional policies. For example, consider a case like Conitzer’s (2016, 2017), in which copies of an EU-maximizing agent face different but symmetric information. Depending on what the agent knows about its algorithm, it may view all the copies as equivalent or not. If it has relatively little self-knowledge, it could reason that if it lets its action depend on the information, the copies’ behavior would diverge. With more self-knowledge, on the other hand, it could reason that, because it is an EU maximizer and because the copies are in symmetric situations, its action will be the same no matter the information received.5

Consciousness

The BPB problem resembles the problem of consciousness: the question “does some physical system implement my algorithm?” is similar to the question “does some physical system have the conscious experience that I am having?”. For now, I don’t want to go too much into the relation between the two problems. But if we suppose that the two problems are connected, we can draw from the philosophy of mind to discuss our approach to BPB.

In particular, I expect that a common objection to the behaviorist approach will be that most instantiations in the behaviorist sense are behavioral p-zombies. That is, their output behavior is equivalent to the algorithm’s but they compute the output in a different way, and in particular in a way that doesn’t seem to give rise to conscious (or subjective) experiences. While the behaviorist view may lead us to identify with such a p-zombie, we can be certain, so the argument goes, that we are not given that we have conscious experiences.

Some particular examples include:

  • Lookup table-based agents
  • Messed up causal structures, e.g. Paul Durham’s experiments with his whole brain emulation in Greg Egan’s novel Permutation City.

I personally don’t find these arguments particularly convincing because I favor Dennett’s and Brian Tomasik’s eliminativist view on consciousness. That said, it’s not clear whether eliminativism would imply anything other than relativism/anti-realism for the BPB problem (if we view BPB and philosophy of mind as sufficiently strongly related).


1. I use the word “algorithm” in a very broad sense. I don’t mean to imply Turing computability. In fact, I think any explicit formal specification of the form “f()=…” should work for the purpose of the present definition. Perhaps, even implicit specifications of the output would work. 

2. Of course, I see how someone would find this counterintuitive. However, I suspect that this is primarily because the rock example triggers absurdity heuristics and because it is hard to imagine a situation in which you believe that your decision algorithm is strongly correlated with whether, say, some rock causes an avalanche. 

3. Although the behaviorist view defines the instance-of-me property via correlation, there can still be correlated physical subsystems that are not viewed as an instance of me. In particular, if you strongly limit the set of allowed interpretations (see the next paragraph), then the potential relationship between your own and the system’s action may be too complicated to be expressed as A outputs x <=> i(H)=x

4. I suspect that the two might differ in medical or “common cause” Newcomb-like problems like the coin flip creation problem

5. If this is undesirable, one may try to use logical counterfactuals to find out whether B also “would have” done the same as A if A had behaved differently. However, I’m very skeptical of logical counterfactuals in general. Cf. the “Counterfactual Robustness” section in Tomasik’s post.