July 2024 – The Universe from an Intentional Stance

Short summary and overview

The SPI framework tells us that if we choose only between, for instance, aligned delegation and aligned delegation plus surrogate goal, then implementing the surrogate goal is better. This argument in the paper is persuasive pretty much regardless of what kind of beliefs we hold and what notion of rationality we adopt. In particular, it should convince us if we’re expected utility maximizers. However, in general we have more than just these two options (i.e., more than just aligned delegation and aligned delegation plus surrogate goals); we can instruct our delegates in all sorts of ways. The SPI formalism does not directly provide an argument that among all these possible instructions we should implement some instructions that involve surrogate goals. I will call this the surrogate goal justification gap. Can this gap be bridged? If so, what are the necessary and sufficient conditions for bridging the gap?

The problem is related to but distinct from other issues with SPIs (such the SPI selection problem, or the question of why we need safe Pareto improvement as opposed to “safe my-utility improvements”).

Besides describing the problem, I’ll outline four different approaches to bridging the surrogate goal justification gap, some which are at least implicit in prior discussions of surrogate goals and SPIs:

The use of SPIs on the default can be justified by pessimistic beliefs about non-SPIs (i.e., about anything that is not an SPI on the default).
As noted above, we can make a persuasive case for surrogate goals when we face a binary decision between implementing surrogate goals and aligned delegation. Therefore, the use of surrogate goals can be justified if we decompose our overall decision of how to instruct the agents into this binary decision and some set of other decisions, and we then consider these different decision problems separately.
SPIs may be particularly attractive because it is common knowledge that they are (Pareto) improvements. The players may disagree or may have different information about whether any given non-SPI is a (Pareto) improvement or not.
SPIs may be Schelling points (a.k.a. focal points).

I don’t develop any of these approaches to justification in much detail.

Beware! Some parts of this post are relatively hard to read, in part because it is about questions and ideas that I don’t have a proper framework for, yet.

A brief recap of safe Pareto improvements

I’ll give a brief recap (with some reconceptualization, relative to the original SPI paper) of safe Pareto improvements here. Without any prior reading on the subject, I doubt that this post will be comprehensible (or interesting). Probably the most accessible introductions to the concepts are currently: the introduction of my paper on safe Pareto improvements with Vince Conitzer, and Sections 1 and 2 of Tobias Baumann’s blog post, “Using surrogate goals to deflect threats”.

Imagine that you and Alice play a game G (e.g., a normal-form game) against each other, except that you only play this game indirectly: you both design artificial agents or instruct some human entities that play the game for you. We will call these the agents. (In the SPI paper, these were called the “representatives”.) We will call you and Alice the principals. (In the SPI paper, these were called the “original players”.) We will sometimes consider cases in which the principals self-modify (e.g., change their utility functions), in which case the agents are the same as (or the “successor agents” of) the principals.

We imagine that the agents have great intrinsic competence when it comes to game playing. A natural (default) way to indirectly play G is therefore for you and Alice to both instruct your agents by giving them your utility function (and perhaps any information about the situation that the agents don’t already have), and then telling them to do the best they can. For instance, in the case of human agents, you might set up a simple incentive contract and have your agent be paid in proportion to how much you like the outcome of the interaction.

Now imagine that you and Alice can (jointly or individually) intervene in some way on how the agents play the game, e.g., by giving them alternative utility functions (as in the case of surrogate goals) or by barring them from playing this or that action. Can you and Alice improve on the default?

One reason why this is tricky is that both with and without the intervention, it may be very difficult to assign an expected utility to the interaction between the agents due to equilibrium selection problems. For example, imagine that the base game G is some asymmetric version of the Game of Chicken and imagine that the intervention results in the agents playing some different asymmetric Game of Chicken. In general, it seems very difficult to tell which of the two will (in expectation) result in a better outcome for you (and for Alice).

Roughly, a safe Pareto improvement on the base game G is an intervention on the interaction between the agents (e.g., committing not to take certain action, giving them alternate objectives, etc.) that you can conclude is good for both players, without necessarily resolving hard questions about equilibrium selection. Generally, safe Pareto improvements rest on relatively weak, qualitative assumptions about how the agents play different games.

Here’s an example of how we might conclude something is Pareto-better than the default. Imagine we can intervene to make the agents play a normal-form game G’ instead of G. Now imagine that G’ is isomorphic to G, and that each outcome of G’ is Pareto-better than its corresponding outcome in G. (E.g., you might imagine that G is just the same as G’, except that each outcome has +1 utility for both players. See the aforementioned articles for more interesting examples.) Then it seems plausible that we can conclude that this intervention is good for both players, even if we don’t know how the agents play G or G’. After all, there seems to be no reason to expect that equilibrium selection will be more favorable (to one of the players) in one game relative to the other.

Introducing the problem: the SPI justification gap

I’ll now give a sketch of a problem in justifying the use of SPIs. Imagine (as in the previous “recap” section) that you and Alice play a game against each other via your respective agents. You get to decide what game the agents play. Let’s say you have four options G₀,G₁,G₂,G₃ for games that you can let the agents play. Let’s say G₀ is some sort of default that G₁ is a safe Pareto improvement on G₀. Imagine further that there are no other SPI relations between any pair of these games. In particular, G₂ and G₃ aren’t safe Pareto improvements on G₀, and G₁ isn’t an SPI on either G₂ or G₃.

For a concrete example, let’s imagine that G₀ is some sort of threat game with aligned delegation (i.e., where the agents have the same objectives as you and Alice, respectively).

	Threaten	Not threaten
Give in	0,1 0,1	1,0 1,0
Not give in	-3,-3 -3,-3	1,0 1,0

The black numbers are the agent’s utility functions; the blue numbers are your and Alice’s utility functions. (Since the agents are aligned with you and Alice, the black numbers are the same as the corresponding blue numbers.)

(You might object that, while this game has multiple equilibria, giving in weakly dominates not giving in for the row player and therefore that we should strongly expect this game to result in the (Give in, Threaten) outcome. However, for this post, we have to imagine that the outcome of this game is unclear. One way to achieve ambiguity would be to imagine that (Not give in, Not threaten) is better for the row player than (Give in, Not threaten), e.g., because preparing to transfer the resource takes some effort (even if the resources are never in fact demanded and thus never transferred), or because there’s some chance that third parties will observe whether the row player is resolved to resist threats and are thus likely to threaten the row player in the future).)

Then let’s imagine that G₁ is the adoption of a surrogate goal and thus the deflection of any threats to that surrogate goal. Let’s therefore represent G₁ as follows:

	Threaten surrogate	Not threaten
Give in	0,1 0,1	1,0 1,0
Not give in	-3,-3 1,-3	1,0 1,0

Note that the agent’s utility functions are isomorphic to those in G₀. The only difference (highlighted by bold font) is that (Not give in, Threaten surrogate) is better for the principals (you and Alice) than the corresponding outcome (Not give in, Threaten surrogate) in G₀.

Let’s say G₂ simply consists in telling your agent to not worry so much about threats against you being carried out. Let’s represent this as follows:

	Threaten	Not threaten
Give in	0,1 0,1	1,0 1,0
Not give in	-1,-3 -3,-3	1,0 1,0

Finally, let’s say G₃ is a trivial game in which you and Alice always receive a payoff of ⅓ each.

Clearly, you should now prefer that your agents play G₁ rather than G₀. Following the SPI paper, we can offer straightforward (and presumably uncontroversial) justifications of this point. If furthermore you’re sufficiently “anchored” on the default G₀, then G₁ is your unique best option from what I’ve told you. For example, if you want to ensure low worst-case regret relative to the default you might be compelled to choose G₁. (And again, if for some reason you were barred from choosing G₂, G₃, then you would know that you should choose G₁ rather than G₀.) The SPI paper mostly assumes this perspective of being anchored on the default.

But now let’s assume that you approach the choice between G₀, G₁, G₂, and G₃ from the perspective of expected utility maximization. You just want to choose whichever of the six games is best. Then the above argument shows that G₀ should not be played. But it fails to give a positive account of which of the four games you should choose. In particular, it doesn’t make much of a positive case for G₁, other than that G₁ should be preferred over G₀. From an expected utility perspective, the “default” isn’t necessarily meaningful and there is no reason to play an SPI on the default.

In some sense, you might even say that SPIs are self-defeating: They are justified by appeals to a default, but then they themselves show that the default is a bad choice and therefore irrelevant (in the way that strictly dominated actions are irrelevant).

The problem is that we do want a positive account of, say, surrogate goals. That is, we would like to be able to say in some settings that you should adopt a surrogate goal. The point of surrogate goals is not just a negative claim about aligned delegation. Furthermore, in practice whenever we would want to use safe Pareto improvements or surrogate goals, we typically have lots of options, i.e., our situation is more like the G₀ versus … versus G₃ case than a case of a binary choice (G₀ versus G₁). That is, in practice we also have lots of other options that stand in no SPI relation to the default (and for which no SPI on the default is an SPI on them). (For example, we don’t face a binary choice between aligned delegation with and without surrogate goal. Within either of these options we have all the usual parameters – “alignment with whom?”, choice of methods, hyperparameters, and so on. We might also have various third options, such as the adoption of different objectives, intervening on the agent’s bargaining strategy, etc.) (Of course, there might also be multiple SPIs on the default. See Section 6 of the SPI paper and the discussion of the SPI selection problem in the “related issues” section below.)

For the purpose of this post, I’ll refer to this as the SPI justification gap. In the example of surrogate goals, the justification gap is the gap between the following propositions:

A) We should prefer the implementation of surrogate goals over aligned delegation.

B) We should implement surrogate goals.

The SPI framework justifies A very well (under various assumptions), including from an expected utility maximization perspective. In many settings, one would think B is true. But unless we are willing to adopt a “default-relative” decision making framework (like minimizing worst-case regret relative to the default), B doesn’t follow from A and the SPI framework doesn’t justify B. So we don’t actually have a compelling formal argument for adopting surrogate goals and in particular we don’t know what assumptions (if any) are needed for surrogate goals to be fully compelling (assuming that surrogate goals are compelling at all).

Now, one immediate objection might be: Why do we even care about whether an SPI is chosen? If there’s something even better than an SPI (in expectation), isn’t that great? And shouldn’t we be on the lookout for interventions on the agents’ interaction that are even better than SPIs? The problem with openness to non-SPIs is basically the problem of equilibrium selection. If we narrow our decision down to SPIs, then we only have to choose between SPIs. This problem might be well-behaved in various ways. For instance, in some settings, the SPI is unique. In other settings even the best-for-Alice SPI (among SPIs that cannot be further safely Pareto-improved upon) might be very good for you. Cf. Section 6 on “SPI selection problem section” in the original SPI paper. Meanwhile, if we don’t narrow our interventions on G down to SPIs on G, then in many cases the choice between interventions on G is effectively just choosing a strategy for G, which may be difficult. In particular, it may be that if we consider all interventions on G, then we also need to consider interventions that we only very slightly benefit from. For example, in the case of games with risks of conflict (like the Demand Game in the original SPI paper), all SPIs that cannot be further safely improved upon eliminate conflict outcomes (and don’t worsen the interaction in other ways). Whereas, non-safe Pareto improvements (i.e., an intervention that is good for both players in expectation relative to some probabilistic beliefs about game play) may also entail a higher probability of conceding more (in exchange for a decreased risk of conflict).

Below, I will first relate the justification gap problem to some other issues in the SPI framework. I will then discuss a few approaches to addressing the SPI justification gap. Generally, I think compelling, practical arguments can be made to bridge the SPI justification gap. But it would be great if more work could be done to provide more rigorous arguments and conditions under which these arguments do and don’t apply.

I should note that in many real-world cases, decisions are not made by expected utility maximization (or anything very close to it) anyway. So, even if the justification gap cannot be bridged at all, SPIs could remain relevant in many cases. For instance, in many cases interventions have to be approved unanimously (or at least approximately unanimously) by multiple parties with different values and different views about equilibrium selection. If no intervention is approved, then the default game is played. In such cases, SPIs (or approximate SPIs) on the default may be the only interventions that have a good chance at universal approval. For instance, an international body might be most likely to adopt laws that are beneficial to everyone while not changing the strategic dynamics much (say, rules pertaining to the humane treatment of prisoners of war). (In comparison, they might be less likely to adopt improvements that change the strategic dynamics – say, bans of nuclear weapons – because there will likely be someone who is worried that they might be harmed by such a law.) Another real-world example is that of “grandfathering in” past policies. For instance, imagine that an employer changes some of its policies for paying travel expenses (e.g., from reimbursing item by item to paying a daily allowance in order to decrease administrative costs). Then the employer might worry that some of its employees will prefer the old policy. In practice, the employer may wish to not upset anyone and thus seek a Pareto improvement. In principle, it could try to negotiate with existing employees to find some arrangement that leaves them happy while still enabling the implementation of the new policy. (For instance, it could raise pay to compensate for the less favorable reimbursement policy.) But in many contexts, it may be easier to simply allow existing employees to stay on the old policy (“grandfathering in” the old policy), thus guaranteeing to some extent that every employee’s situation will improve.

Some related issues

For clarification, I’ll now discuss a few other issues and clarify how they relate to the SPI justification gap, as discussed in this post.

First, the SPI justification gap is part of a more general question: “Why SPIs/surrogate goals?” As described above, earlier work gives, in my view, a satisfactory answer to a different aspect of the “Why SPIs?” question. Specifically, SPIs are persuasive in case of binary comparison: if we have a choice between two options G and G’ and G’ is an SPI on G, then we have good enough reason to choose G’ over G. This post is specifically about an issue that arises when we move beyond binary comparisons. (More generally, it’s about issues that arise if we move beyond settings where one of the available options is an SPI on all the other available options.)

Second, the justification gap is a separate problem from the SPI selection problem (as discussed in Section 6 of the SPI paper). The SPI justification gap arises even if there is just one SPI (as in the above G₀,….,G₃ example). Conversely, if we can in some way bridge the SPI justification gap (i.e., if we can justify restricting attention to interventions that are SPIs), then the SPI selection problem remains. The problems also seem structurally different. The justification gap isn’t intrinsically about reaching multiplayer agreement in the way that SPI selection is. That said, some of the below touches on SPI selection (and equilibrium selection) here and there. Conversely, we might expect that approaches to SPI selection would also relate to SPI justification. For instance, some criteria for choosing between SPIs could also serve as criteria that favor SPIs over non-SPIs. For example, maximizing minimum utility improvement both induces (partial) preferences over SPIs and preferences for SPIs over non-SPIs.

Second, the safe Pareto improvement framework assumes that when deciding whether to replace the default with something else, we restrict attention to games that are better for both players. (This is a bit similar to imposing a “voluntary participation” or “individual rationality” constraint in mechanism design.) In the context of surrogate goals this is motivated by worries of retaliation against aggressive commitments. From the perspective of an expected utility maximizer, it’s unclear where this constraint comes from. Why not propose some G that is guaranteed to be better for you, but not guaranteed to be better for the other player than the default? This is a different question than the question we ask in this post. Put concisely, this post is about justifying the “safe” part, not about justifying the “Pareto” part. (At least in principle, we could consider “safe my-utility improvements” (which are explicitly allowed to be bad for the opponent relative to the default) and ask when we are justified in specifically pursuing these over other interventions. I’m here considering safe Pareto improvements throughout, because the concept of safe Pareto improvement is more widely known and discussed.) Nevertheless, I think the problem of justifying the Pareto constraint – the constraint that the other player is (“safely” or otherwise) not made worse off than the default – is similar and closely related to the SPI justification gap. Both are about a “safety” or “conservativeness” constraint that isn’t straightforwardly justifiable on expected-utility grounds. See also “Solution idea 3” below, which gives a justification of safety that is closely related to the Pareto requirement (and justifications thereof).

Solution idea 1: Pessimistic beliefs about one’s ability to choose non-SPIs

Consider again the choice between G₀,….,G₃, where G₁ is an SPI on G₀ and there are no other SPI relations. Probably the most straightforward justification for SPIs would be a set of beliefs under which the expected utility of G₀ is higher than the expected utility of any of the games G₂ and G₃. Specifically, you could have a belief like the following, where G₀ is the default: If I can’t prove from some specific set of assumptions (such as “isomorphic games are played isomorphically”, see the SPI paper) that u(G) > u(G₀), then I should believe E[u(G)] < E[u(G₀)]. (I might hold this either as a high-level belief or it might be that all my object-level beliefs satisfy this.) Of course, in some sense, this is just a belief version of decision principles like minimizing regret relative to the default. Michael Dennis et al. make this connection explicit in the appendix of “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design” as well as some ongoing unpublished work on self-referential claims.

Intuitively, non-SPI pessimism is based on trust in the abilities of the agents that we’re delegating to. For instance, in the case of threats and surrogate goals, I might be pessimistic about telling my agent that the opponent is likely bluffing, because I might expect that my agent will have more accurate beliefs than me about whether the opponent is bluffing. (On the other hand, if I can tell my agent that my opponent is bluffing in a way that is observable to the opponent, then such announcements are a sort of credible commitment. And if I can commit in this way and my agent cannot commit, then it might very well be that I should commit, even if doing so in some sense interferes with my agent’s competence.) One might be pessimistic about G₂ for similar reasons.

I don’t think such pessimistic beliefs are plausible in general, however. One argument against “non-SPI pessimism” is that one can give “almost SPIs”, about which I would be optimistic, and about which we need to be optimistic if we wish to apply SPIs in the real world (where exact SPIs are hard to construct). For example, take the Demand Game example from the SPI paper and decrease Player 1’s payoffs of the outcome (DL,RL) by 0.00001. You can’t make a clean SPI argument anymore, but I’d still in practice be happy with the approach proposed by the SPI paper. Or take the above example (the choice between G₀, G₁, G₂, G₃), imagine that as a better version of G₃ you can also choose a guaranteed payoff of 0.99. If you want the Pareto part, you may imagine that Alice also receives 0.99 in this case. Rejecting this option seems to imply great optimism about G₀. (You need to believe that Alice’s agent will play Not Threaten with probability >99%.) Such optimism seems uncompelling. (We could construct alternative games G₀ in which such optimism seems even less compelling.) Maybe the example is even more powerful if we take a third party’s perspective that wants to improve both players’ utility (and imagine again that a certain payoff of 0.99 for both players is available). After all, if G₀ is better than 0.99 for one player, 0.99 is much better than G₀ for the other player.

Solution idea 2: Decision factorization

(I suspect this section is harder to read than the others. There seem to be more complications to these ideas than those in the other sections. I believe the approach in this section to be the most important, however.)

One intuitive justification for the application SPIs is that the decision of whether to use the SPI/surrogate goal can be made separately from other decisions. This is quite explicit in the original surrogate story: Commit to the surrogate goal now – you can still think about, say, whether you should commit to not give in later on. If right now you only decide whether to use a surrogate goal or not, then right now you face a binary decision where one option is an SPI on the other, which is conceptually unproblematic. If instead, you decide between all possible ways of instructing your agents at once, you have lots of options and the SPI concept induces only a partial order on these options. In some cases a decision may be exogenously factorized appropriately, e.g., you might be forced to only decide whether to use surrogate goals on day 1, and to make other decisions on day 2 (or even have someone else make other decisions on day 2). The challenge is that in most real-world cases, decisions aren’t externally factorized.

Here’s a toy model, inspired by surrogate goals. Let’s say that you play a game in which you choose two bits, so overall you have four actions. However, imagine that you choose the two bits in sequence. Perhaps when thinking about the first bit, you don’t yet know what second bit you will choose. Let’s say that each sequence of bits in turn gives rise to some sort of game – call these games G₀₀, G₀₁, G₁₀, and G₁₁, where the first bit encodes the first choice and the second encodes the second. Intuitively you might imagine that the first choice is the choice of whether to (immediately) adopt a surrogate goal, and the second choice is whether to make some commitment to resist unfair outcomes. Importantly, you need to imagine that if you adopt the surrogate goal, then your choice on the second day will be made on the basis of your new utility function, the one that includes the surrogate goal. Anyway, you don’t need to have such a specific picture in mind.

Now, let’s say that you find that G₁₀ is an SPI on G₀₀ and that G₁₁ is an SPI on G₀₁. Imagine further that the choice of the second bit is isomorphic between the case where the first bit chosen was 1 and the case where the second bit chosen was 0, that is: in G₁ (the state of affairs where the first bit was chosen to be 1 and the second bit is not yet chosen) you will choose 0 if and only if you choose 0 in G₀. Then overall, you can conclude that G₁ is an SPI on G₀. Since in the first time step you only choose between 0 and 1, you can conclude by SPI logic from the above that you should choose 1 at the first time step. (Of course, the decision in G₁ will still be difficult.)

Here’s a non-temporal informal example of how factorization might work. Imagine that AliceCorp is building an AI system. One of the divisions of AliceCorp is tasked with designing the bargaining behavior of the AI system (e.g., by training the AI system on some set of bargaining scenarios). A different “SPI division” is tasked with assessing whether to modify the system’s utility function (e.g., by modifying the training rewards) to include surrogate goals. This SPI division is explicitly not supposed to induce a utility function that causes the system to give in less – influencing the bargaining policy is the job of the bargaining division after all. Importantly, the bargaining division either doesn’t know about the possibility of surrogate goals or doesn’t care about it. (For example, imagine that the bargaining division is paid in proportion to how well the AI system ends up handling bargaining situations in the real world. Then it is important that if a surrogate goal is implemented, the bargaining division receives low incentive pay if a surrogate threat is carried out.) If AliceCorp happens to be set up in this way, then – ignoring SPI selection, loads of implementation details, etc. – the SPI division faces a binary choice between implementing the SPI and not implementing the SPI.

Again, the main idea here is that we factorize our decision into multiple decisions and then we consider these decisions separately. In particular, the decision of whether to adopt the SPI or not becomes an easy binary decision.

Note that the factorization argument requires some structure on the set of possible games that might be played. The structure is what allows you to make the decision about the SPI in isolation, without considering your other choices. (For instance, adopting a surrogate goal doesn’t stop you from committing to never give in later on.) The argument doesn’t work if you just choose between four arbitrary games (as in the G₀ versus … G₃ case), one of which is an SPI on one of the other ones.

Also, note that depending on the details of the settings, the separation of the different decisions seems to be necessary to make the story compelling. For example, in the surrogate goal story, it’s important to first adopt surrogate goals and only then decide whether to make other commitments. If you, say, self-modify all at once to have a surrogate goal and to never give in, then the opponent should treat this as if you had just committed to never give in without the surrogate goal. There are other cases in which the order does not matter. For example, imagine that you choose 0 versus 1 and also choose between A, B, and C. Then you play some game whose payoffs are determined in some complex way by the A/B/C choice. Additionally, the first number (0 vs. 1) is simply added to each payoff in the resulting game. Choosing 1 is an SPI over choosing 0, and there’s no particular problem with choosing 0 versus 1 and A versus B versus C at the same time. (And there’s also no need, it would seem, to have separate “departments” make the two different choices.) So in this case it doesn’t matter whether the factorization is somehow (e.g., temporarily) (self-)imposed on the decision maker. It still matters that the factorization exists, though. For instance, if choosing the first number to be 1 precludes choosing the letter A, then we can’t necessarily make a case for choosing 1 over 0 anymore.

In a recent draft, DiGiovanni et al. (2024) provide a justification of safe Pareto improvements from an expected utility maximization standpoint. (See also their less technical blog post: “Individually incentivized safe Pareto improvements in open-source bargaining”.) I think, and Anthony DiGiovanni tentatively agrees, that one of its underlying ideas can be viewed as factorizing the choice set (which in their setting is the set of computer programs). In particular, their Assumptions 5 and 9 each enable us to separate out the choice of “renegotiation function” (their mechanism for implementing SPIs) from other decisions (e.g., how hawkish to be), so that we can reach conclusions of the form, “whatever we choose for the ‘other stuff’, it is better to use a renegotiation function” (see their Proposition 2 and Theorem 3). Importantly their results don’t require the factorization to manifest in, say, the temporal order of decisions. So in that way it’s most similar to the example at the end of the preceding paragraph. They also allow the choice of the “other stuff” to depend on the presence of negotiation, which is different from all examples discussed in this post. Anyway, there’s a lot more to their paper than this factorization and I myself still have more work to do to get a better understanding of it. Therefore, a detailed discussion of this paper is beyond the scope of this post, and the discussion in this section is mostly naive to the ideas in their paper.

I like the factorization approach. I think this is how I usually think of surrogate goals working in practice. But I nonetheless have many questions about how well this justification works. Generally, the story assumes many aspects about the agent’s reasoning as given and I’d be interested in whether these assumptions could be taken to be variable. Here are some specific concerns.

1) In some cases, the story assumes that decisions are factorized and/or faced in some fixed specific order that is provided exogenously. Presumably, the SPI argument doesn’t work in all orders. For example, in the “two-bits story” let’s say you first have to choose {11,00} versus {10,01} and then choose within the respective set (as opposed to first choosing {10,11} versus {00,01} and then within the chosen set). Then neither of the two decisions individually can be resolved by an SPI argument. What if the agent can choose herself in what order to think about the different decisions?

There are lots of ideas to consider here. For example, maybe agents should first consider all possible ordered factorizations of their decision and then use a factorization that allows for an SPI. But this all seems non-trivial to work out and justify. For instance, in order to avoid infinite regress, many possible lines of meta-reasoning have to be suppressed. But then we need a justification for why this particular form of meta-reasoning (how can I decompose my decision problems in the best possible way?) should not be suppressed. Also, how do we arrive at the judgment that the SPI-admitting factorization is best? It seems that this to some extent presupposes some kind of pro-SPI judgment.

2) Relatedly, the argument hinges to some extent on the assumption that there is some point in time at which the opportunity to implement the SPI expires (while relevant other decisions – the ones influenced by the SPI – might still be left to be made later). For example, in the two-bit story, the SPI can only be implemented on the first day, and an important decision is made on the second day. In some real-world settings this might be realistic.* But in some cases, it might be unrealistic. That is, in some cases the decision of whether to adopt the SPI can be delayed for as long as we want.

As an example, consider the following variant of the two-bits story. First, imagine that in order for choosing the first bit to be 1 to be an SPI over choosing it to be 0, we do need to choose the bits sequentially. For instance, imagine that the first bit determines whether the agent implements a surrogate goal and the second bit determines whether the agent commits to be a more aggressive bargainer. Now imagine that the agent has 100 time steps. (Let’s assume that the setting doesn’t have any “commitment race”-type dynamics – there’s no benefit to commit to an aggressive strategy early in order to make it so that the opponent sees your commitment before they make a commitment. E.g., you might imagine that a report of your commitments is only sent to opponents after all 100 time steps have passed.) In each time step, the agent gets to think a little. It can then make some commitment about the two bits. For example, it might commit to not choose 11, or commit to choose 1 as the first bit. Intuitively, we want the agent to at some point choose the first bit, and then choose the second bit in the last step. Perhaps in this simple toy model you might specifically want the agent to choose the first bit immediately, but in most practical settings, figuring out whether a surrogate goal could be used and figuring out how to best apply it takes some time. In fact, one would think that in practice there’s almost always some further thinking that could be done to improve the surrogate goal implementation. If there’s SPI selection, then you might want to wait to decide what SPI to go for. This would suggest that the agent would push the surrogate goal implementation back to the 99th time step. (Because sequential implementation is required, you cannot push implementation back to the 100th time step – on the 100th time step, you need to choose whether to commit to an aggressive bargaining policy while already having adopted the surrogate goal.)

The problem is that if you leave the SPI decision to the final time step, this leaves very little time for making the second decision (the decision of whether to adopt an aggressive bargaining policy). (You might object: “Can’t the agent think about the two decisions in parallel?” But one big problem with this is that before adopting a surrogate goal (before setting the first bit to be 1), the agent has different goals than it has after adopting the surrogate goal. So, if the agent anticipates the successful implementation of surrogate goals, it has little interest in figuring out whether future threats would be credible. In general, the agent would want to think thoughts that make its future self less likely to give in. (No, this doesn’t violate the law of total probability.))

One perspective on the above is that even once you (meta-)decided to first decide whether or how to implement the surrogate goal, you still face a ternary choice at each point: A) commit to the surrogate goal; B) commit against the use of the surrogate goal; and C) wait and decide A versus B later. The SPI framework tells you that A > B. But it can’t compare A and C.

Solution idea 3: Justification from uncertainty about one another’s beliefs

Here’s a third idea. Contrary to the other two approaches, this one hinges much more on the multi-agent aspect of SPIs. Roughly, this justification for playing SPIs on the default is as follows: Let’s say it’s common knowledge that the default way of playing is good (relative to a randomly selected intervention, for instance). Also, if G’ is an SPI on G, then by virtue of following from a simple set of (chosen-to-be-)uncontroversial assumptions, it is common knowledge that G’ is better than G. So, everyone agrees that G’ would be better for everyone than G, and everyone knows that everyone else agrees that G’ would be better for everyone than G, and so on. In contrast, if G’ is not an SPI on G and your belief that G’ is better than G is based on forming beliefs about how the agents play G (which might have many equilibria), then you don’t know for sure whether the other players would agree with this judgment or not. This is an obstacle to the use of “unsafe Pareto improvements” for a number of reasons, which I will sketch below. Compare the Myerson–Satterthwaite theorem, which suggests that Pareto-improving deals often fail to be made when the parties have access to private information about whether any given deal is Pareto-improving. Relative to this theorem, the following points are pre-theoretical.

First and most straightforwardly, if you (as one of the principals) know that G’ is better for both you and Alice than G, but you don’t know whether Alice (the other principal) knows that G’ is better (for her) than G, then – in contexts where Alice has some sort of say – you will worry that Alice will oppose G’. If there’s a restriction on the number of times that you can propose an intervention, this is an argument against proposing G’ from an expected utility maximization perspective.

Here’s an extremely simple numeric example. Let’s say you get to propose one intervention and Alice has to accept or reject. If she rejects, then G is played. (Note that we thereby already avoid part of the justification problem – the default G is, to some extent, imposed externally in a way that it normally would not be.) Now let’s say that G’ is an SPI on G. Let’s say that you also think that some other G’’ is better in expectation for you than G, but you think there’s a 20% chance that Alice will disagree that G’’ is better than G. Then for you to propose G’’ rather than G’, it would need to be the case that 80% * (u(G’’) – u(G)) > u(G’) – u(G’’). (This assumes that you don’t take Alice’s judgment as evidence about how good G’’ is for you.) Thus, even if the SPI G’ is slightly worse for you than the non-SPI G’’, you might propose the SPI G’, because it is more likely to be accepted.
As an extension of the above, you might worry that the other player will worry that any proposed non-SPI was selected adversarially. That is, if you propose playing G’ instead of G to Alice and G’ is not an SPI on G, then Alice might worry that you only proposed G’ because it disproportionately favors you at her cost. That is, Alice might worry that you have private information about the agents and that this information tells you that G’ is specifically good for you in a way that might be bad for her.

Here’s again a toy example. Let’s imagine again that G is in fact played by default (which, again, assumes part of the justification gap away). Imagine that there is exactly one SPI on G. Let’s say that additionally there are 1000 other possible games G₁,…,G₁₀₀₀ to play that all look like they’re slightly better in expectation for both players than G. Finally, imagine that you receive private information that implies that one of the 1000 games is very good for you and very bad for the other player. Now let’s say that you’ve specifically observed that G₂₃₃ is the game that is great for you and bad for your opponent. If you now propose that you and the opponent play G₂₃₃, your opponent can’t tell that you’re trying to trick them. Conversely, if you propose G₈₁₆, then your opponent doesn’t know that you’re not trying to trick them. With some caveats, standard game-theoretic reasoning would therefore suggest that if you propose any of G₁,…,G₁₀₀₀, you ought to propose G₂₃₃. Knowing this, the other principal will reject the proposal. Thus, you know that proposing any of G₁,…,G₁₀₀₀ will result in playing G. If instead you propose the SPI on G, then the other principal has no room for suspicion and no reason to reject. Therefore, you should propose G.

There are many complications to this story. For example, in the above scenario, the principals could determine at random which of the G_i to play. So if they have access to some trusted third party who can randomize for them, they could generate l at random and then choose to play G_i. (They could also use cryptography to “flip a coin by telephone” if they don’t have access to a trusted third party.) But then what if the games in G₁,…,G₁₀₀₀ aren’t Pareto improvements on average, and instead the agents also have private information about which of these games are (unsafe) Pareto improvements? There are lots of questions to explore here, too many to discuss in this post.
As a third consideration, agents might worry that in the process of settling on a non-SPI, they reveal private information, and that revealing private information might have downsides. For example, if the default G is a Game of Chicken, then indicating willingness to “settle” for some fixed amount reveals a (lower bound on the) belief in the opposing agent’s hawkishness. It’s bad for you to reveal yourself to believe in the opponent’s hawkishness.

Again, there are lots of possible complications, of course. For instance, zero-knowledge mechanisms could be used to bargain in a way that requires less information to be exchanged.

Solution idea 4: Safe Pareto improvements as Schelling points

To end on a simpler idea: Safe Pareto improvements may be Schelling points (a.k.a. focal points) in some settings. For instance, you might imagine a setting in which the above justifications don’t quite apply. Then SPIs might stand out by virtue of standing out in slight alterations of the setting in which the above justifications do apply.

It seems relatively hard to say anything about the Schelling points justification of safe Pareto improvements, because it’s hard to justify preferences for Schelling points in general. For instance, imagine you and Bob have to both pick a number between 1 and 10 and you both get a reward if you both pick the same number. Perhaps you should pick 1 (or 10) because it’s the smallest (resp. largest) number. If you know each other to be Christians, perhaps you should pick 3. Perhaps you should pick 7, because 7 is the most common favorite number. If you just watched a documentary about Chinese culture together, perhaps you should pick 8, because 8 is a lucky number in China. And so on. I doubt that there is a principled answer as to which of these arguments matters most. Similarly I suspect that in complex settings (in which similarly many Schelling point arguments can be made), it’s unclear whether SPIs are more “focal” than non-SPIs.

Acknowledgments

Discussions with Jesse Clifton, Anthony DiGiovanni and Scott Garrabrant and inspired parts of the contents of this post. I also thank Tobias Baumann, Vincent Conitzer, Emery Cooper, Michael Dennis, Alexander Kastner and Vojta Kovarik for comments on this post.

* Arguing for this is beyond the scope of this post, but just to give a sense: Imagine that AliceCorp builds an AI system. AliceCorp has 1000 programmers and getting the AI to do anything generally requires a large effort by people across the company. Now I think AliceCorp can make lots of credible announcements about what AliceCorp is and isn’t doing. For example, some of the programmers might leak information to the public if AliceCorp were to lie about how it operates. However, once AliceCorp has a powerful AI system, it might lose the ability to make some forms of credible commitments. For example, it might be that AliceCorp can now perform tasks by having a single programmer work in collaboration with AliceCorp’s AI system. Since it’s much easier to find a single programmer who can be trusted not to leak (and presumably the AI system can be made not to leak information), it’s now much easier for AliceCorp to covertly implement projects that go against AliceCorp’s public announcements.

	Jesse Clifton on Decision Theory and the Irrele…
	Lukas Finnveden on Cooperative AI competitions wi…
	Caspar on Cooperative AI competitions wi…
	Lukas Finnveden on Cooperative AI competitions wi…
	Lukas Finnveden on Cooperative AI competitions wi…

The Universe from an Intentional Stance

Month: July 2024

A gap in the theoretical justification for surrogate goals and safe Pareto improvements