The lack of performance metrics for CDT versus EDT, etc.

(This post assumes that the reader is fairly familiar with the decision theory of Newcomb-like problems. Schwarz makes many of the same points in his post “On Functional Decision Theory” (though I disagree with him on other things, such as whether to one-box or two-box in Newcomb’s problem). Similar points have also been made many times about the concept of updatelessness in particular, e.g., see Section 7.3.3 of Arif Ahmed’s book “Evidence, Decision and Causality”, my post on updatelessness from a long time ago, or Sylvester Kollin’s “Understanding updatelessness in the context of EDT and CDT”. Preston Greene, on the other hand, argues explicitly for a view opposite of the one in this post in his paper “Success-First Decision Theories”.)

I sometimes read the claim that one decision theory “outperforms” some other decision theory (in general or in a particular problem). For example, Yudkowsky and Soares (2017) write: “FDT agents attain high utility in a host of decision problems that have historically proven challenging to CDT and EDT: FDT outperforms CDT in Newcomb’s problem; EDT in the smoking lesion problem; and both in Parfit’s hitchhiker problem.” Others use some variations of this framing (“dominance”, “winning”, etc.), some of which I find less dubious because they have less formal connotations.

Based on typical usage, these words make it seem as though there was some agreed upon or objective metric to compare decision theories in any particular problem and that MIRI is claiming to have found a theory that is better according to that metric (in some given problems). This would be similar to how one might say that one machine learning algorithm outperforms another on the CIFAR dataset, where everyone agrees that ML algorithms are better if they correctly classify a higher percentage of the images, require less computation time, fewer samples during training, etc.

However, there is no agreed-upon metric to compare decision theories, no way to asses even for a particular problem whether one decision theory (or its recommendation) does better than another. (This is why the CDT-versus-EDT-versus-other debate is at least partly a philosophical one.) In fact, it seems plausible that finding such a metric is “decision theory-complete” (to butcher another term with a specific meaning in computer science). By that I mean that settling on a metric is probably just as hard as settling on a decision theory and that mapping between plausible metrics and plausible decision theories is fairly easy.

For illustration, consider Newcomb’s problem and a few different metrics. One possible metric is what one might call the causal metric, which is the expected payoff if we were to replace the agent’s action with action X by some intervention from the outside. Then, for example, in Newcomb’s problem, two-boxing “performs” better than one-boxing and CDT “outperforms” FDT. I expect that many causal decision theorists would view something of this ilk as the right metric and that CDT’s recommendations are optimal according to the causal metric in a broad class of decision problems.

A second possible metric is the evidential one: given that I observe that the agent uses decision theory X (or takes action Y) in some given situation, how big of a payoff do I expect the agent to receive. This metric directly favors EDT in Newcomb’s problem, the smoking lesion, and again a broad class of decision problems.

A third possibility is a modification of the causal metric. Rather than replacing the agent’s decision, we replace its entire decision algorithm before the predictor looks at and creates a model of the agent. Despite being causal, this modification favors decision theories that recommend one-boxing in Newcomb’s problem. In general, the theory that seems to maximize this metric is some kind of updateless CDT (cf. Fisher’s disposition-based decision theory). 

Yet another causalist metric involves replacing from the outside the decisions of not only the agent itself but also of all agents that use the same decision procedure. Perhaps this leads to Timeless Decision Theory or Wolfgang Spohn’s proposal for causalist one-boxing.

One could also use the notion of regret (as discussed in the literature on multi-armed bandit problems) as a performance measure, which probably leads to ratificationism.

Lastly, I want to bring up what might be the most commonly used class of metrics: intuitions of individual people. Of course, since intuitions vary between different people, intuition provides no agreed upon metric. It does, however, provide a non-vacuous (albeit in itself weak) justification for decision theories. Whereas it seems unhelpful to defend CDT on the basis that it outperforms other decision theories according to the causal metric but is outperformed by EDT according to the evidential metric, it is interesting to consider which of, say, EDT’s and CDT’s recommendations seem intuitively correct.

Given that finding the right metric for decision theory is similar to the problem of decision theory itself, it seems odd to use words like “outperforms” which suggest the existence or assumption of a metric.

I’ll end with a few disclaimers and clarifications. First, I don’t want to discourage looking into metrics and desiderata for decision theories. I think it’s unlikely that this approach to discussing decision theory can resolve disagreements between the different camps, but that’s true for all approaches to discussing decision theory that I know of. (An interesting formal desideratum that doesn’t trivially relate to decision theories is discussed in my blog post Decision Theory and the Irrelevance of Impossible Outcomes. At its core, it’s not really about “performance measures”, though.)

Second, I don’t claim that the main conceptual point of this post is new to, say, Nate Soares or Eliezer Yudkowsky. In fact, they have written similar things, see, for instance, Ch. 13 of Yudkowsky’s Timeless Decision Theory, in which he argues that decision theories are untestable because counterfactuals are untestable. Even in the aforementioned paper, claims about outperforming are occasionally qualified. E.g., Yudkowsky and Soares (2017, sect. 10) say that they “do not yet know […] (on a formal level) what optimality consists in”.) Unfortunately, most outperformance claims remain unqualified. The metric is never specified formally or discussed much. The short verbal descriptions that are given make it hard to understand how their metric differs from the metrics corresponding to updateless CDT or updateless EDT.

So, my complaint is not so much about these authors’ views but about a Motte and Bailey-type inconsistency, in which the takeaways from reading the paper superficially are much stronger than the takeaways from reading the whole paper in-depth and paying attention to all the details and qualifications. I’m worried that the paper gives many casual readers the wrong impression. For example, gullible non-experts might get the impression that decision theory is like ML in that it is about finding algorithms that perform as well as possible according to some agreed-upon benchmarks. Uncharitable, sophisticated skim-readers may view MIRI’s positions as naive or confused about the nature of decision theory.

In my view, the lack of an agreed-upon performance measure is an important fact about the nature of decision theory research. Nonetheless, I think that, e.g., MIRI is doing and has done very valuable work on decision theory. More generally I suspect that being wrong or imprecise about this issue (that is, about the lack of performance metrics in the decision theory of Newcomb-like problems) is probably not an obstacle to having good object-level ideas. (Similarly, while I’m not a moral realist, I think being a moral realist is not necessarily an obstacle to saying interesting things about morality.)


This post is largely inspired by conversations with Johannes Treutlein. I also thank Emery Cooper for helpful comments.

The Stag Hunt against a similar opponent

[I assume that the reader is familiar with Newcomb’s problem and the Prisoner’s Dilemma against a similar opponent and ideally the equilibrium selection problem in game theory.]

The trust dilemma (a.k.a. Stag Hunt) is a game with a payoff matrix kind of like the following:


Its defining characteristic is the following: The Pareto-dominant outcome (i.e., the outcome that is best for both players) (S,S) is Nash equilibrium. However, (H,H) is also a Nash equilibrium. Moreover, if you’re sufficiently unsure what your opponent is going to do, then H is the best response. If two agents learn to play this game and they start out playing the game at random, then they are more likely to converge to (H,H). Overall, we would like it if the two agents played (S,S), but I don’t think we can assume this to happen by default.

Now what if you played the trust dilemma against a similar opponent (specifically one that is similar w.r.t. how they play games like the trust dilemma)? Clearly, if you play against an exact copy, then by the reasoning behind cooperating in the Prisoner’s Dilemma against a copy, you should play S. More generally, it seems that a similarity between you and your opponent should push towards trusting that if you play S, the opponent will also play S. The more similar you and your opponent are, the more you might reason that the decision is mostly between (S,S) and (H,H) and the less relevant are (S,H) and (H,S).

What if you played against an opponent who knows you very well and who has time to predict how you will choose in the trust dilemma? Clearly, if you play against an opponent who can perfectly predict you (e.g., because you are an AI system and they have a copy of your source code source code), then by the reasoning behind one-boxing in Newcomb’s problem, you should play S. More generally, the more you trust your opponent’s ability to predict what you do, the more you should trust that if you play S, the opponent will also play S.

Here’s what I find intriguing about these scenarios. In these scenarios, one-boxers might systematically arrive at a different (more favorable) conclusion than two-boxers. However, this conclusion is still compatible with two-boxing, or with blindly applying Nash equilibrium. In the trust dilemma, one-boxing type reasoning merely affects how we resolve the equilibrium selection problem, which the orthodox theories generally leave open. This is in contrast to the traditional examples (Prisoner’s Dilemma, Newcomb’s problem) in which the two ways of reasoning are in conflict. So there is room for implications of one-boxing, even in an exclusively Nash equilibrium-based picture of strategic interactions.

AI and Ideal Theory

[Epistemic status: Non-careful speculation]

Political philosophy is full of idealized hypothetical scenarios in which rational agents negotiate and determine the basic constitution of society. I suggest that this kind of political philosophy is more relevant to AGI than it is to humans.

Traditional political theory involves things like the state of nature, the veil of ignorance, rational selfish expected utility maximizers, near-costless negotiations and communication, lots of common knowledge, perfect compliance to laws, and the lack of pre-existing precedents, conflicts, or structures constraining options. For examples of this, see Rawls’ A Theory of Justice, Harsanyi’s “Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility,” or Buchanan’s The Calculus of Consent. These are the cleanest and most famous examples of what I’m talking about, but I’m sure there are more.

Here are some reasons to think this stuff is more relevant to AGI than to humans:

  • Humans are constrained by existing institutions and obligations, AGIs might not be.
  • AGIs are more likely to be rational in more ways than humans; in particular, they are more likely to behave like stereotypical expected utility maximizers.
  • AGIs are more likely to share lots of empirical beliefs and have lots of common knowledge.
    • They might be less prone to biases that entrench differences; they might be more epistemically rational, so their beliefs will converge to a much greater extent.
    • Their sheer size will mean they can ingest much the same information as each other–if they all read the whole internet, then that means they have the same information, whereas humans can only read a tiny portion of the internet.
  • AGIs are less likely to share values with each other; their interactions really will look more like a bargain between mutually disinterested agents and less like an attempt to convince each other or entreat each other for sympathy. Thus their negotiations will be closer to what Rawls etc. imagines.
  • AGIs are less likely to be immutable black boxes; they are likely to be able to read and understand the code of other AIs and make modifications to their own code. This means that they can credibly and cheaply commit to binding agreements. For example, a group of AIs could literally build a Leviathan: a new AI that rules over all of them, and whose code they all agreed on. It’s an ideal-theory theorist’s wet dream: an omnipotent, omnibenevolent, immutably stable State.
  • AGIs might go “updateless” pretty early on, before they learn much about the nature of the other agents in the world, meaning that they might actually end up doing something like obeying the rules they would have agreed to from behind a veil of ignorance. Especially if they engage in multiverse-wide cooperation via superrationality.
  • AGIs might be programmed to carry out the wishes of hypothetical humans. For example, in the Coherent Extrapolated Volition proposal, the AGI does what we would have wanted it to do, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together.”

Objection: It may be true that AI are more likely to find themselves in situations like Rawls’ Original Position than humans. But such scenarios are still unlikely, even for AI. The value of this sort of ideal theory is not its predictive power, but rather its normative power: It tells us how we ought to organize our society. And for this purpose it doesn’t matter how likely the situation is to actually obtain–it is hypothetical.

Reply: Well, the ways in which these ideal situations differ from reality are often the basis of critiques of their normative relevance. Indeed I find such critiques compelling. To pick on Rawls, why should it matter to what we should do in 2019 USA what “we” would have agreed to behind a veil of ignorance that not only concealed from us our place in society but surgically removed our ethical views and intuitions, our general world-views, and our attitudes toward risk and made us into risk-averse egoists? My (pessimistic) claim is that if these scenarios have any value at all, it is their predictive value–and they are more likely to have that for AI than for humans.

That said, I think I may be too pessimistic. Recent developments in decision theory (updatelessness, MSR) suggest that something like one of these scenarios might be normatively important after all. Further research is needed. At any rate, I claim that they are more likely to be normatively important for AI (or for how we should design AI) than for us.

Grimdark Cyberkant

[Author’s note: Still uploading old papers. This one is rather unpolished, I wrote it for a class in a hurry and probably no longer endorse everything in it. Also, I had a bit too much fun with it. 🙂 But hey, might as well put it up; otherwise it’ll never see the light of day. Note that this paper tries to assume zero knowledge about formal epistemology or Kant in the reader, so parts of it may be tediously familiar.]

1. Introduction

“But they’re only uploads.” Pamela stares at him. “Software, right? You could reinstantiate them on another hardware platform, like, say, your Aineko. So the argument about killing them doesn’t really apply, does it?”

“So? We’re going to be uploading humans in a couple of years. I think we need to take a rain check on the utilitarian philosophy, before it bites us on the cerebral cortex. Lobsters, kittens, humans—it’s a slippery slope.”

–Quote from Accelerando by Charles Stross (1)

The goal of this paper is to investigate the many similarities between Kant’s Categorical Imperative and some recent developments in decision theory. I will argue that the similarities run surprisingly deep, and that as a result formal epistemologists and Kant scholars may have things to learn from each other.

The ideas in this paper are fresh and exciting—but I can’t take credit for most of them. Obviously Kant’s work is his own, and moreover the recent developments in formal epistemology aren’t mine either. (Oesterheld 2017) In fact, I’m not the even the first to notice similarities between the two. (Tomasik 2015) However, no one to my knowledge has investigated how deep the comparisons go. My goal is to distill the key ideas and arguments from the literature and draw connections to Kant, so I’ll be using chains of english reasoning rather than derivations in formal models. Thus, this paper understates the rigor of these arguments and ideas; for a more thorough presentation, look to the literature.

2. Kant

This section gives a brief overview of Kant’s categorical imperative, to set up for later sections. There are three formulations of the Categorical Imperative:

2.1. First Formulation: “Act only according to that maxim whereby you can at the same time will that it should become a universal law.” (421) (2)

Kant thinks it is a necessary law for all rational beings always to judge their actions according to this imperative. (426) I take this to mean that obeying this imperative is a requirement of rationality; it is always irrational to disobey it.

2.2. Second Formulation: “Act in such a way that you treat humanity, whether in your own person or in the person of another, always at the same time as an end and never simply as a means.” (429)

Kant clarifies that strictly speaking it’s not just humans that should be treated this way, but all rational beings. He specifically says that this does not extend to non-rational beings. (428) Moreover, and crucially for my purposes, Kant clarifies that treating someone as an end means striving to further their ends. (430) It means more than that, as well, but I’ll leave that to the footnotes. (3)

2.3. Third Formulation: “Act in accordance with the maxims of a member legislating universal laws for a merely possible kingdom of ends.” (439)

This one takes some explaining. The idea is that, in keeping with the first formulation, each rational being should think of its actions as at the same time legislating universal laws that determine the conduct of all other rational beings. In keeping with the second formulation, these laws take into consideration the ends of all rational beings. So, metaphorically, we can think of all rational beings as both subjects and legislators in one big happy kingdom—the kingdom of ends.

2.4. Equivalence: Mysteriously, Kant seems to think all three formulations are equivalent: “The aforementioned three ways of representing the principle of morality are at bottom only so many formulas of the very same law: one of them by itself contains a combination of the other two.” (436) This is mysterious because the first two formulations certainly do not seem equivalent. One of my goals in this paper is to show how, if we take the formal epistemology interpretation seriously, we might be able to glimpse their equivalence.

2.5. Reason’s Common Principle: Finally, I follow  scholars like O’Neill, Korsgaard, and Westphal  in interpreting Kant as having a unified theory of reason, in the following sense: The Categorical Imperative is not merely a principle that applies to choosing actions; it also applies to choosing beliefs.
(Williams 2018) This interpretation is controversial but there is at least one passage which directly supports it: “To make use of one’s own reason means no more than to ask oneself, whenever one is supposed to assume something, whether one could find it feasible to make the ground or the rule on which one assumes it into a universal principle for the use of reason.” (8:146n) I mention this because it is yet another parallel between Kant and recent formal epistemology work—more on this later.

3. Cyberkant

This section explains (my preferred version of) some recent developments in decision theory, presented in a way that highlights the comparisons to Kant.

3.1 The Beginning: Two copies of the same algorithm will behave in the same way. (This is analytic; if two things don’t behave the same way it’s because they are no longer copies of the same algorithm.) Moreover, even when the algorithms aren’t the same, they can be relevantly similar. For example, the algorithm instantiated in my computer (a PC) may be different from the algorithm instantiated in yours (a Mac) but if we each use our respective calculator utilities to calculate the square root of 120935, we’ll get the same answer. Another example: Suppose I design a chess-playing program, and then you take it and tweak it so that it has a better end-game, but you leave it otherwise the same. Each of the resulting two programs will behave the same way in the early game; I could use the behavior of mine to predict the behavior of yours and vice versa, even if there was no causal link between them—even if yours was instead built by aliens. So far this is uncontroversial.

3.2. The Fundamental Insight: What happens when you are an algorithm? Consider a very intelligent chess-playing algorithm, capable of deep reflection on itself and its place in the world. If it knows there may be other copies of itself out there, then it knows that, when it chooses an opening move, it is thereby choosing not only for itself but for all its copies; and when it chooses how to respond to a particular complicated mid-game position, it is thereby choosing how all of its copies will respond to that position; and so forth.

Many decision theorists (See footnote) have begun to argue that we should think of our decisions in this way. (4) Perhaps there aren’t any copies of us out there, but if there were, we should choose as if we were choosing for all of them. After all, it’s logically impossible for your copy to do something different from you—so your choice of what to do logically determines theirs.

So far this remains in the realm of science fiction, since there aren’t in fact any such copies of us. Bear with me for a little longer.

Suppose that you found yourself in a Prisoner’s Dilemma against a copy of yourself. If you both cooperate, you each get a payoff of utility 10. If you both defect, you each get a payoff of utility 1. If one cooperates and the other defects, the payoffs are 0 and 11 respectively.

According to what I’m calling “the fundamental insight,” the rational thing to do is cooperate, because if you cooperate the copy will also cooperate, and if you defect the copy will also defect. This is actually very controversial in decision theory; for many decades now the orthodox position has been Causal Decision Theory, which says that the rational thing to do in this situation is defect. (The rationale is intuitively compelling to many: Either the copy will defect, or they won’t. You have no control over them, since you are causally disconnected. And regardless of what they do you are better off defecting. So you should defect, by dominance reasoning.) Nevertheless, a growing literature has explored the you-should-cooperate option, and given various justifications and elaborations of the idea.

3.3. Shades of Kant: The idea we arrived at above is that rationality recommends choosing as if you are choosing for all copies of yourself that happen to be in the same situation. The similarity to the first formulation of the Categorical Imperative is clear.

There is also a similarity to the Second and Third formulations, which can be brought out as follows: Consider what would happen if you think “What I’d like most is for me to choose Defect and my twin to choose Cooperate. Aha! I’ll make up my mind to choose the action that most benefits me, and thus both of us will choose the action that most benefits me, and so I’ll defect and my twin will cooperate!” Obviously, this won’t work. What will happen is that your twin will also think that way, and defect. For your twin, “me” refers to them, not you.

Extrapolating from this idea, there is no way to game the system: You should choose as if you are choosing for all copies of yourself that happen to be in the same situation, but if you choose in a way intended to benefit your utility function at the expense of theirs, it won’t work—you’ll end up worse off than if you had chosen in a way intended to benefit your combined utility function. So the way to get the most benefit for yourself is to choose in the way that benefits all your copies as well. You must treat their ends as your own, if you want them to treat your ends as their own. And you do want that, because there are gains from trade to be had: cooperate-cooperate is better than defect-defect.

3.4. Expanding the circle of concern:

The obvious, and perhaps the only, difference between Kant’s Categorical Imperative and the requirement of rationality we’ve sketched has to do with the scope of concern. Kant’s imperative commands choosing on behalf of all rational agents, and treating the ends of all rational agents as your own. This decision-theoretic imperative commands the same, but only for copies of yourself in similar situations.

But the circle can be expanded. Previously we mentioned the case of two chess-playing algorithms that are different, but relevantly similar. Analogously, we can imagine modifying the Twin Prisoner’s Dilemma scenario so that your twin is not exactly the same (but still relevantly similar) or so that your twin’s situation is not exactly the same (but still relevantly similar!). For example, suppose you value the wellbeing of animals and for that reason don’t eat meat; your twin is just like you except that they treat video game characters as you treat animals and vice versa. So by default you will eat lentils and play violent video games while they will eat bacon and play peaceful games. But, I claim, rationality recommends reasoning as follows: “It costs me little to switch to peaceful games, and my twin would greatly appreciate it. Similarly, it costs them little to switch to lentils, and I would greatly appreciate it. So we’d both be better off if we both switched. Since we are so similar in every other way besides this, they’re probably going through the same thought process right now… so if I switch, they probably will too, and if I decide not to switch, they probably will too. So I should switch.”

If this is right, then the circle expands not just to exact copies of yourself, but to clusters of relevantly-similar people. We’ll get people who may have radically different ends/utilityfunctions/values deciding to adopt each others’ ends/utilityfunctions/values as their own, forming little communities of mutual benefit. Villages of Ends!

(And as a reminder, all of this happens without any causal mechanism linking the various cooperators. What matters is that each person thinks the other is out there somewhere, not that they actually communicate and see each other’s behavior.)

How far does this “relevantly similar” take us? Well, remember, what’s really going on here is that (if you are rational) you will decide to cooperate with people such that you believe the following of them: They’ll cooperate with you iff you cooperate with them.

This means that if you are rational, you will decide to cooperate with people who are rational. Thinking of ourselves as analogous to chess algorithms can take us a long way towards Kant! (5) For convenience, henceforth I’ll refer to this decision-theoretic line of reasoning as “Cyberkant.”

3.5. Caveat: Power relations: Cyberkant sounds too good to be true, and it is. There’s a caveat which I did not mention until now. The sort of cooperation we’ve been talking about only happens when the agents are of roughly equal power—that is, when both sides are able to do something that costs them little but helps the other side out a lot. What happens when one side is unable to bring anything to the table, so to speak?

In that case, even if they are rational they will be left out of the bargain, because it won’t be true that “If I cooperate with them, they’ll cooperate with me.” (Alternatively, their cooperation with you isn’t valuable enough to outweigh the cost of cooperating with them.) Note that what matters is how they are perceived: Even if they can in fact bring lots to the table, if no one believes they can, the cooperation won’t occur. One corollary of this is that people who believe the universe is very small are unlikely to engage in much cooperation; cooperation depends on thinking that somewhere out there, someday, there might be someone in a position to do something you like, someone who you are currently in a position to benefit. This will be discussed further in later sections.

This is where the “Grimdark” comes in. As things currently stand, Cyberkantian cooperation is still not as universal as Kantian cooperation. Instead of a fairy-tale Kingdom of Ends in which every rational agent is both subject and legislator and all are treated equally, we have a truly medieval Kingdom of Ends, a stratified, classist society: the powerful cooperate with the powerful, the weak cooperate with the weak, those in between cooperate with those in between… and the irrational get left out entirely. For example, people who follow Causal Decision Theory must fend for themselves. Also, people who simply haven’t thought through what rationality demands of them, and thus aren’t even thinking about this sort of cooperation, are on their own.

4. Egalitarian reform in the kingdom

Let’s see what we can do to make our Cyberkantian kingdom of ends more egalitarian. The decision theorists who have proposed these ideas are considerably more optimistic than this paper has been so far about the number of pleasant consequences we can derive from this line of reasoning.

4.1. Slippery Slope: Should you cooperate with rational agents less powerful than yourself? By hypothesis the cost of cooperating with them outweighs the benefit of having them cooperate with you. Yet there is another benefit to consider: Rational agents who are more powerful than you may cooperate with you if and only if you cooperate with rational agents less powerful than you. This won’t work well if there are very few levels of power or if everyone knows quite a lot about where they stand. (6) However, now the universe is on our side, because realistically most agents won’t know exactly where they stand in the hierarchy. (7) So unless you are very confident that you are at the top of the food chain—unless you are very confident that you won’t ever be in a position of vulnerability, where someone else can help you greatly at little cost to themselves—you should treat all rational agents with beneficence.

Another way of putting this idea is as follows: If you are rational and yet you don’t deign to help rational people in great need even when there is nothing they (or anyone like them) can offer you in return, then people who are rational won’t deign to help you when you are in great need. Insofar as you think you aren’t invincible, this should convince you.

But what about people who do think they are invincible? Does the Cyberkantian categorical imperative have no authority over them?

4.2. Planning from the Original Position: There’s a separate route to getting a similar sort of result, and it complements the above nicely, because it is strong in precisely the cases where the “Slippery slope” solution is weak.

Most people who think they are powerful didn’t always think that. There was a time in the past when they weren’t so sure of their position. At that time, if they have the ability to bind themselves to follow a plan, they’ll bind themselves to follow the plan “Be nice to rational agents less powerful than me, even if I find out that I’m on top of the food chain.” They’ll do this because if they do it, then other rational agents will do it too, and so if they don’t end up on top they’ll still be treated nicely.

Of course, not everyone has the ability to bind themselves to a plan. But I think it’s fair to say that many of the most powerful agents do. (8) Moreover, there are some arguments in the literature (see e.g. Drescher 2006, Soares & Fallenstein 2015, Oesterheld 2017) that the rational thing to do just is following the plan that you would have bound yourself to at the beginning of your epistemic life.

I won’t get into this here, but I’ll try to motivate it briefly as follows: Rationality is already distinct from success. Someone who does the rational thing can still end up unluckily losing everything as a result, and someone who does the irrational thing can still end up winning as a result. Nevertheless rationality is still a useful concept. Why it is useful is an interesting question, but perhaps it has to do with praise and blame and system design. Now, if you disobey the plan that you would have bound yourself to at some previous time, you are in effect at war with yourself: Your different time-slices disagree about what a particular time-slice should do, and so e.g. your past self will pay to bind your future self. This is bad system design; it’s better to have a system that just does what it would have bound itself to do. (9) So perhaps it’s not crazy to suggest that, not only should we design systems to be temporally consistent, but we should praise and blame agents in this manner as well.

4.3. But what about irrational and immoral agents?

The previous two subsections attempted to unify the Cyberkantian Kingdom of Ends into something resembling Kant’s Kingdom of Ends. As we’ve seen, it’s difficult to extend the Kingdom to include the extremely powerful, though there are at least some lines of argument that point in that direction.

This subsection revisits a question which came up earlier: What about irrational and/or immoral agents? Are they locked out in the cold?

To some extent, the arguments in the previous two subsections can be extended to apply here as well. Perhaps we should think: “I’m not perfectly rational or moral myself. I should be nice to people who are less rational/moral than me, so that people who are more rational/moral than me will be nice to me.”

Clearly this line of reasoning can be taken too far, though. It would be foolish to be nice to bacteria, for example. Moreover it would be downright dangerous to be nice to evil people–by that I mean, evil by Cyberkantian standards, i.e. people who have contemplated cooperating with you but chose not to—because if you are nice to people who aren’t nice to you, then people won’t have an incentive to be nice to you. Indeed it is just as much a conclusion of Cyberkantian reasoning that you should turn the cold shoulder to people who knowingly disobey Cyberkantian reasoning as it is that you should cooperate with people who obey it.

I think it is an open question how far this reasoning extends. For now, until further investigation is complete, all we can say is that the strength of your decision-theoretic reasons to be nice to people varies proportionally to how rational they are and to how moral they are.

5. Not so different after all

I’ll be the first to admit that despite all these similarities there remain some differences between Kant and Cyberkant. (I cannot go against centuries of Kant scholarship!) However, as this section will argue, the two are not as different as they sound.

5.1. Corruption in the Kingdom of Ends

Previously we discussed the inegalitarian nature of the Cyberkantian kingdom, and how it can be ameliorated, but not without residual difficulty: Some people are invincible and they know it; why should they help anyone else? There are lines of reasoning that get even those people into the fold, but we could be forgiven for being skeptical.

I think that Kant is in a similar situation. What would Kant say to someone who knows themselves to be invincible? They would be perfectly happy to will their maxims to be universal laws; a world in which everyone goes around cheating and stealing and killing would be just fine for them, since they are invincible. Consider how Kant justifies our duty of beneficence:

A fourth man finds things going well for himself but sees others (whom he could help) struggling with great hardships; and he thinks: what does it matter to me? Let everybody be as happy as Heaven wills or as he can make himself; I shall take nothing from him nor even envy him; but I have no desire to contribute anything to his well-being or to his assistance when in need…. But even though it is possible that a universal law of nature could subsist in accordance with that maxim, still it is impossible to will that such a principle should hold everywhere as a law of nature. For a will which resolved in this way would contradict itself, inasmuch as cases might often arise in which one would have need of the love and sympathy of others and in which he would deprive himself, by such a law of nature springing from his own will, of all hope of the aid he wants for himself. (423, emphasis mine)

The similarity to the decision-theoretic justification is striking!

Of course, perhaps Kant was not trying to convince the invincible egoist. Perhaps Kant can work with a more substantive notion of reason that doesn’t include such people. But if this is the way we go, then the difference between Kant and Cyberkant (on this matter at least) is merely verbal and could be erased by importing a more substantive notion of reason into the Cyberkantian framework, to complement the thin notion it currently uses.

5.2. Think of the children, the immoral, and the causal decision theorists

Another difficulty with Cyberkant is that his Kingdom brutally excludes many people who, intuitively, ought to be included: Those who haven’t thought things through enough to contemplate the Categorical Imperative or the logical correlations between their behavior and that of others; those who have thought these things through but chose not to obey; and those who are in the grip of a false theory of rationality. Again, there are arguments in the Cyberkantian framework to extend the Kingdom to include such people, but we can be forgiven for being skeptical.

Kant, I claim, has this problem as well. He is abundantly clear that the Kingdom extends to include all and only the rational beings. (433, 434) Consider this: “A rational being belongs to the kingdom of ends as a member when he legislates in it universal laws while also being himself subject to these laws.” (433) Consider also: “Now morality is the condition under which alone a rational being can be an end in himself, for only thereby can he be a legislating member in the kindgom of ends.” (435)

I interpret these passages as saying that Immoral and irrational beings are not in Kant’s Kingdom of Ends. I suggest that this is a charitable interpretation, because it makes the justification of the Categorical Imperative much less mysterious: Why should rational beings act as if irrational beings will behave similarly? (By contrast: The question “Why should rational beings act as if every rational being behave similarly” almost answers itself.) Why should moral beings take on the ends of immoral beings as their own? Don’t we want to incentivise moral behavior?

6. Formal Epistemology learns from Kant

Previously I mentioned how, for Kant, the categorical imperative is a principle common to all reason, not just practical reason. This strikes me as an idea worth exploring in formal epistemology, because it could be used to do quite a lot of useful work.

6.1. Justifying a modified version of expected utility maximization

Expected utility maximization has some problems. How do you handle prospects of infinite utility, as seen in e.g. the St. Petersburg game or Pascal’s Wager? Whereas Cyberkant can be thought of as using expected utility maximization to ground the categorical imperative, we can also explore doing it the other way round: Why maximize expected utility? Because it’s the best long-run policy. It’s also the best policy for groups of people to adopt. In general, expected utility maximization does great when applied over many independent choices… the many times in which high-risk, high-reward gambles fail are outweighed by the few times they succeed, and if you try sufficiently many times, the probability that some will succeed is high. This might also solve the problems arising from St. Petersburg and the Wager: Some gambles are so risky that the probability that at least one of them will succeed is still very low, even if every rational agent always took such gambles. Other gambles are correlated: If we all take Pascal up on his wager, is there a decent chance that at least one of us will win? No.

6.2. Justifying the Principal Principle and the Bland Indifference Principle:

There are different versions of the Principal Principle; I’m going to talk about this one:

fPCP: If we know what the frequency of A is, and all our background information is stochastically irrelevant, then we rationally should have credence in A equal to the frequency of A.

Strevens has argued that fPCP (and indeed many other versions of the Principal Principle) is impossible to justify. Think about it: The fact that the frequency of coin-tosses landing heads is 0.5 does not entail that this coin-toss will be heads, nor does it entail that it will be tails.

Meanwhile, many (most?) people in formal epistemology accept something like the following principle: (Bostrom 2003)

dBIP: If we know that A is true for a fraction x of all observers of our type, and we have no special information that makes A more or less likely, then we should set our credence in A to x.

If we adopt the Categorical Imperative for theoretical reason, perhaps we can justify something like dBIP and then from there justify fPCP. What follows is a sketch of this reasoning:

If we choose as if we are choosing on behalf of everyone—and if we furthermore choose so as to take into account the epistemic ends of everyone—then it seems we should set our credence in A to x. After all, A is true for fraction x of the total population, so if we are choosing which credence for everyone to have, the credence that maximizes total accuracy will be x. (10) Having thereby justified dBIP or something like it, we can justify a version of fPCP: If we know that the frequency of A happening for observers of our type is x, then… etc. For practical purposes this version of fPCP will be just as good as the real thing, because (arguably) the inductive evidence we get to support scientific claims like “Half of all tritium atoms decay within X years” is equally good evidence for the claim “Half of all tritium atoms that people in situations like ours are wondering about decay within X years.”

7. What Kant can learn from Cyberkant:

Possibly nothing. But here are two suggestions. First, a major difficulty in interpreting Kant (as with any great philosopher) is finding a way to reconstruct their arguments as valid and convincing. On the face of it Kant seems to be saying that morality can be deduced from rationality; this is a very implausible claim that motivates lots of exegesis on what he means by rationality and morality. Similarly, Kant seems to think that all three formulations of the Categorical Imperative are equivalent, which is also a very surprising claim. (436) I submit that we should consider interpreting Kant as Cyberkant, and see how far that gets us—it gets us a good deal of the way towards vindicating both of those surprising claims, for example.

Secondly, even if we think (as seems probable) that Kant was very different, the structural similarities can be used as a source of inspiration to solve problems for Kant. In particular, a common objection to Kant is that his theory collapses unless we have some way to rule out self-serving maxims like “All rational agents do what they think Daniel Kokotajlo wants them to do.” But how can we exclude these maxims in a non-ad-hoc way that also avoids undermining Kant’s justification for obeying the Categorical Imperative in the first place?

Cyberkant has a ready answer to this problem: It’s a fact that if I help other rational agents in need, other rational agents will help me in need. But it’s not a fact that if I help other rational agents named Daniel and ignore the rest, other rational agents will also help other rational agents named Daniel and ignore the rest. Why is the first If-then claim a fact and the other one not? Because that’s how our minds are structured; it’s a fact we know about the world. (11)

7. Conclusion

This paper is, admittedly, rather speculative. However, I think that the similarities highlighted here are intriguing enough to motivate further investigation. Even though Kant and Cyberkant are doubtless very different, they might be able to learn some things from each other. (12)

8. Endnotes

  1. This is a sci-fi novel by Charlie Stross about a slow takeoff distributed technological singularity. The full text is free online at There are different ways to interpret the quote, but my preferred interpretation will soon be clear.
  2. Another version: “Act as if the maxim of your action were to become through your will a universal law.” (421) This formulation is good for my purposes because, as we shall see, there is a sense in which your action does become a universal law, and decision-theoretic reasons to think that you should act accordingly.
  3. I’m not sure the formal epistemology treatment will be able to match the additional meaning of end-in-itself that Kant seems to have, the meaning by which there are certain things you can’t do to somebody even if they & everyone else in the entire world wishes it to be done. That said, it’s not out of the question to interpret Kant as primarily concerned with the adopt-their-ends-as-your own version of this imperative, and thinking of everything else as derivative. This is the interpretation Geoff Sayre-McCord teaches his undergrads, for example, and I find textual support for it in 429 where Kant explains that each rational creature thinks of his own existence as an end–perhaps this is why everyone’s existence should be treated as an end.
  4. Arguably evidential decision theory–the second-most-popular decision theory–thinks this way, though perhaps only as a special case of something even more general. Drescher, Soares & Fallenstein, and Oesterheld think this way. Doug Hofstadter and Chris Meacham seem attracted to it too. For Hofstadter, see For the others, see the References.
  5. This is where the “Cyber” in the title comes from.
  6. Agents could reason using backwards induction: The most powerful won’t cooperate with anyone less powerful, and so the second-most powerful won’t cooperate with anyone less-powerful either, and so… everything falls apart.
  7. For example, say there are 100 levels and the people on the top and bottom know it, but everyone else is clueless about their position. Each of the 98 people in the middle will reason: If I be nice to people less powerful than me, then all the people (except for the guy on top) will be nice to me, and so (unless I’m near the top, which is unlikely) I’ll benefit.
  8. The game theory literature is full of reasons why being able to self-bind is useful. The decision theory literature has talked about this, too. (Meacham 2010)
  9. Saves having to pay those binding costs.
  10. This is true for standard measures of accuracy like the Brier score.
  11. Originally I thought that Kant could straightforwardly solve the problem by engaging in a regress: In order to choose “Everyone does what Daniel wants” as my maxim, I’d have to first set myself the goal of finding a self-serving maxim, and that would not be universalizeable. However, it’s not strictly speaking true that I’d have to do that. We can imagine a possible world in which I choose “Everyone does what Daniel wants” as my maxim not via malice but out of some other motive or perhaps no motive at all, and in that world Kant would need to say that my action satisfied the Categorical Imperative.
  12. Many thanks to Markus Kohl, Karl Adam, and Krasimira Filcheva for helpful discussion.

9. References

Bostrom, N. (2003) “Are you living in a computer simulation?” Philosophical Quarterly, Vol. 53, No. 211, pp. 243-255.

Drescher, G. (2006) Good and Real. MIT Press.

Kant, I. (1781) Critique of Pure Reason. Guyer and Wood translation. Cambridge University Press, 1998.

Kant, I. (1785) Grounding for the Metaphysics of Morals. J. Ellington translation. Hackett publishing company 1993.

Meacham, C. (2010)“Binding and its Consequences” Philosophical Studies, 149(1): 49–71, 2010.

Oesterheld, C. (2017) “Multiverse-wide Cooperation via Correlated Decision Making.” Unpublished. Available online at

Soares, N. and Fallenstein, B. (2015) “Toward Idealized Decision Theory” arXiv: 1507.01986, 2015.

Strevens, M. (1999) Objective probability as a guide to the world. Philosophical Studies 95: 243-275.

Tomasik, B. (2015) “Interpreting the Categorical Imperative.” Blog post. Available online at:

Williams, G. (2018) “Kant’s Account of Reason”, The Stanford Encyclopedia of Philosophy (Summer 2018 Edition), Edward N. Zalta (ed.), forthcoming URL = <>.

Five minutes on whether the industrial revolution was a high-leverage time

In this excellent post, Ben Garfinkel asks:

An analogy is sometimes made to the industrial revolution and the agricultural revolution. The idea is that in the future, impacts of AI may be substantial enough that there will be changes that are comparable to these two revolutionary periods throughout history.
The issue here, though, is that it’s not really clear that either of these periods actually were periods of especially high leverage. If you were, say, an Englishman in 1780, and trying to figure out how to make this industry thing go well in a way that would have a lasting and foreseeable impact on the world today, it’s really not clear you could have done all that much. 

I figured it would be worth spending five minutes thinking about whether it would be reasonable for a smart, wealthy, effective-altruist Englishman in 1780 to focus on trying to steer the long-term future via influencing the industrial revolution.

  1. It should have been clear to our hypothetical Englishman that the industrial revolution would lead to an increase in economic power that would result in an increase in military power. Thus, he might have been able to predict that the colonialism sweeping the globe would intensify thanks to the IR. He could influence this by bringing the IR to other countries directly, rather than waiting for the British Empire to get even bigger and more powerful. (For example: Bringing steam engine technology to China or Japan would probably work decently well at preventing them from getting colonized.) If for whatever reason he decided that British colonization was a good thing, he could have sped it up by sabotaging the IR as it spread to other countries (France, etc.)
  2. On the subject of speeding up and slowing down, he could have done research on the likely effects of such increased worker productivity on society–would it lead to a higher or lower standard of living? He could then have agitated with the British government to accelerate or slow down the IR. (For example, he might reasonably conclude that the IR would render slave plantations obsolete, and thus make abolishing slavery easier. Or he might conclude instead that factory labor is particularly suitable to slavery. I don’t myself know which is right, but that’s because I don’t know history that well; living through it perhaps our Englishman could have made some decent guesses at least.)
  3. Miscellaneous: He could have tried to influence whether the IR is publicly associated with conservatism or progressivism or monarchism or whatever, or whether it is seen as politically neutral. He could have tried to anticipate the problems it would cause (pollution, urbanization, plagues) and begin working on solutions early.

OK, out of time & ideas. (That was more like 10 minutes). What, if anything, to conclude?

I think it’s plausible that a smart, wealthy Englishman in 1780 would have had a decent amount of leverage on the future via the IR–but also plausible that he would have had even more leverage via sociopolitical things like advocating for democracy, or slavery abolition, or decolonialism, or whatnot. And it’s also plausible that he would have had even more leverage by advancing the scientific method, especially in medicine.

I’m not sure what to make of this, if anything. I think there is a big disanalogy between AI safety stuff and the IR, namely, that there seems to be a real risk of AI takeover existential risk, whereas there was no such thing for the IR. For the IR, the biggest points of leverage were over how fast it happened and over where it happened. For AI, those points of leverage exist also, but there is also a much bigger and more important lever having to do with whether or not we all die.

I think I would very seriously consider abandoning my focus on AI if I were convinced that AI wasn’t an existential threat. If, for example, the control problem were solved, so that I was confident AI would be merely like a second IR, then I would maybe shift to more sociopolitical activism, trying to change social structures to better prepare for AI (or to distribute it more equitably).

Mozi, Mengzi, and Effective Altruism

[Author’s note: Another old paper makes it online!]

A while back I took a class on ancient chinese philosophy, taught by David Wong (who is great, by the way.) One of the things that Professor Wong noted was the similarity between modern-day ethical debates surrounding effective altruism and the ancient dispute between Mozi and Mengzi. I dug a little deeper and made a list of all the similarities I could think of; this is that list. Comments welcome:

I’m sufficiently impressed by these similarities that I think it is fair to say that Mozi and Mengzi really were talking about the same issues that feature prominently in debates about EA today; they were even making many of the same arguments and taking many of the same positions. I think this is really cool. If I ever teach intro to philosophy (or ancient philosophy, or non-western philosophy, or intro to ethics) in university, I intend to include a section on Mozi/Mengzi/EA.

An analogy for the simulation argument

To people already familiar with the argument, this may be old news. It’s an analogy to help people feel the intuitive force of the argument.

Suppose the following: New diseases appear inside people all the time. Only one in ten diseases is contagious, but the contagious diseases spread to a thousand other people, on average, before dying out. The non-contagious ones don’t spread to anyone else. You get sick. You haven’t yet gone to the doctor or googled your symptoms; given the above facts, part of you wonders whether you should bother… after all, 90% of diseases only ever infect one person. So, how likely is it that someone else has had your disease before?

Draw a Venn diagram with two circles: “First victim” and “contagious.” This makes three interesting categories:

A = First victims of noncontagious diseases

B = First victims of contagious diseases

C = Nonfirst victims of contagious diseases

 Since only one in ten diseases is contagious, B/A = 1/9.Since contagious diseases spread to a thousand people on average, B/C is 1/1000. It follows that C/(A+B+C) = 100/101.

Since you know you are diseased, you should assign credence to each hypothesis “I’m in A” “I’m in B” “I’m in C” equal to the fraction of diseased people who are in that category, at least until you get evidence that you would be more likely to get in one category than another.

(For example, if you google your symptoms and get zero hits, that’s evidence that is more likely to happen to people in group A or B, so you should update to be less confident that you are in group C than the base rate would suggest.)

So you are almost certain to be in category C, since they make up 100/101 of the total population of diseased people, and at the moment you don’t have any evidence that is less likely to happen to people in category C than people in the other categories. So it’s almost certain that other people have had your disease already, even though 90% of diseases only ever infect one person.

In the simulation argument, the reasoning is similar.

Suppose that 90% of non-simulated civilizations never create ancestor sims, but that the remaining 10% create 1000 each on average. Then most civilizations are ancestor-simulated.

Do we have any evidence that is more likely to happen to people who are ancestor-simulated than people who are not, or vice versa?

You might think the answer is yes: For example, simulated people are more likely to see things that appear to break the laws of physics, more likely to see pop-up windows saying “you are in a simulation,” etc. 

But Bostrom anticipates this when he restricts his argument to ancestor simulations. Ancestor simulations are designed to perfectly mimic real history, so the answer is no: Someone in category C is just as likely to see what we see as someone in category B or A.

So even if we rule out non-ancestor-simulations entirely, the argument goes through: Most civilizations with evidence like ours are ancestor-simulated, and we don’t have any reason to think we are special, so we are probably ancestor-simulated. And of course, we shouldn’t rule out non-ancestor-simulations entirely, so the probability that we are simulated should be even higher than the probability we are ancestor-simulated.

(Clarificatory caveat: Bostrom’s argument is NOT for the conclusion that we are in a simulation; rather, it is a triple-disjunct. In the framing discussed here, Bostrom leaves open the possibility that Group A is extremely large relative to Group B, large enough that it is bigger than B and C combined.)

Great Map of the Mind

We have all these theories and debates about parts of the mind; why not make a big map to show how they all fit together? 

Obviously, minds aren’t all the same. But having a map like this helps us organize our thoughts, to better understand our own minds and the minds we are trying to design and reason about. I’d love to see better versions, or elaborations of this one, or entirely different mind-designs.

If you want the original document: Get, then follow this link and click “Open with…” and then select I’d love it if people spin off improved versions.