Self-improvement races

Most of my readers are probably familiar with the problem of AI safety: If humans could create super-human level artificial intelligence the task of programming it in such a way that it behaves as intended is non-trivial. There is a risk that the AI will act in unexpected ways and given its super-human intelligence, it would then be hard to stop.

I assume fewer are familiar with the problem of AI arms races. (If you are, you may well skip this paragraph.) Imagine two opposing countries which are trying to build a super-human AI to reap the many benefits and potentially attain a decisive strategic advantage, perhaps taking control of the future immediately. (It is unclear whether this latter aspiration is realistic, but it seems plausible enough to significantly influence decision making.) This creates a strong motivation for the two countries to develop AI as fast as possible. This is the case especially if the countries would dislike a future controlled by the other. For example, North Americans may fear a future controlled by China. In such cases, countries would want to invest most available resources into creating AI first with less concern for whether it is safe. After all, letting the opponent win may be similarly bad as having an AI with entirely random goals. It turns out that under certain conditions both countries would invest close to no resources into AI safety and all resources into AI capability research (at least, that’s the Nash equilibrium), thus leading to an unintended outcome with near certainty. If countries are sufficiently rational, they might be able to cooperate to mitigate risks of creating uncontrolled AI. This seems especially plausible given that the values of most humans are actually very similar to each other relative to how alien the goals of a random AI would probably be. However, given that arms races have frequently occurred in the past, a race toward human-level AI remains a serious worry.

Handing power over to AIs holds both economic promise and a risk of misalignment. Similar problems actually haunt humans and human organizations. Say, a charity hires a new director who has been successful in other organizations. Then this creates the opportunity for the charity to rise in influence. However, it is also possible that the charity changes in a way that the people currently or formerly in charge wouldn’t approve of. Interestingly, the situation is similar for AIs which create other AIs or self-improve themselves. Learning and self-improvement are the paths to success. However, self-improvements carry the risk of affecting the goal-directed behavior of the system.

The existence of this risk seems true prima facie: It should be strictly easier to find self-improvements that “probably” work than it is to identify self-improvements that are guaranteed to work. The former is a superset of the latter. So, AIs which are willing to take risks while self-improving can improve faster. (ETA: Cf. page 154 of Max Tegmark’s 2017 book Life 3.0 (published after this blog post was originally published).)

There are also formal justifications for the difficulty of proving self-improvements to be correct. Specifically, Rice’s theorem states that for any non-trivial property p, there is no way of deciding for all programs whether they have this property p. (If you know about the undecidability of the halting problem, Rice’s theorem follows almost immediately from that.) As a special case, deciding for all programs whether they are pursuing some goals is impossible. Of course, this does not mean that proving self-improvements to be correct is impossible. After all, an AI could just limit itself to the self-improvements that it can prove correct (see this discussion between Eliezer Yudkowsky and Mark Waser). However, without this limitation – e.g., if it can merely test some self-improvement empirically and implement it if it seems to work – an AI can use a broader range of possible self-modifications and thus improve more quickly. (In general, testing a program also appears to be a lot easier than formally verifying it, but that’s a different story.) Another relevant problem from (provability) logic may be Löb’s theorem which roughly states that a logical system with Peano arithmetic can’t prove another logical mechanism with that power to be correct.

Lastly, consider Stephen Wolfram’s more fuzzy concept of computational irreducibility. It basically states that as soon as a system can produce arbitrarily complex behavior (i.e., as soon as it is universal in some sense), predicting how most aspects of the system will behave becomes fundamentally hard. Specifically, he argues that for most (especially for complex and universal) systems, there is no way to find out how they behave other than running them.

So, self-improvement can give AIs advantages and ultimately the upper hand in a conflict, but if done too hastily, it can also lead to goal drift. Now, consider the situation in which multiple AIs compete in a head-to-head race. Based on the above considerations this case becomes very similar to the AI arms races between groups of humans. Every single AI has incentives to take risks to increase its probability of winning, but overall this can lead to unintended outcomes with near certainty. There are reasons to assume that this self-improvement race dynamic will be more of a problem for AIs than it is for human factions. The goals of different AIs could diverge much more strongly than the goals of different humans. Whereas human factions may prefer the enemy’s win over a takeover by an uncontrolled AI, an AI with human values confronting an AI with strange values has less to lose from risky self-modifications. (There are some counter-considerations as well. For instance, AIs may be better at communicating and negotiating compromises.)

Thus, a self-improvement race between AIs seems to share the bad aspects of AI arms races between countries. This has a few implications:

  • Finding out (whether there is) a way for AIs to cooperate and prevent self-improvement races and other uncooperative outcomes becomes more important.
  • Usually, one argument for creating AI and colonizing space is that Earth-originating aligned AI could prevent other, less compassionate AIs (uncontrolled or created by uncompassionate ETs) from colonizing space. So, according to this argument, even if you don’t value what humans or human-controlled AIs would do in space, you should still choose it as the lesser of two (or more) evils. However, the problem of self-improvement races puts this argument into question.
  • On a similar note, making the universe more crowded with AIs, especially ones with weird (evolutionary uncommon) values or ones that are not able to cooperate, may be harmful as it could lead to results that are bad for everyone (except for the AI which is created in a self-modification gone wrong).

Acknowledgment: This work was funded by the Foundational Research Institute (now the Center on Long-Term Risk).

18 thoughts on “Self-improvement races

    1. Thanks for your comment! Probably it’s not the time for me to suggest policies, yet. Some research areas that may help assess the problem of self-improvement races in particular are:
      – The game and decision theory of cooperation: Can two agents cooperate without directly seeing what the other one is doing? (An AI could probably conceal fairly well which self-improvement strategy it is using.) Tit-for-tat does not work in this case, anymore.
      – Thinking more about AI self-improvement and assessing the trade-off between reliable goal preservation and fast self-improvement. This does not seem particularly tractable except for relatively simple AI systems.
      – Assessing the probability that an AI will face other AIs originating from other planets, i.e. the probability that there is life on other planets in the universe, that they will colonize space (pass great filters if there are any) and “we”/our AI will meet them/their AI before the heat death of the universe.

      Liked by 1 person

      1. Mark Waser

        >> Caspar said “Probably it’s not the time for me to suggest policies, yet.”

        If only government were so wise as to not suggest policies before they understand . . . .

        May I point you at for some suggestions . . . .

        Liked by 1 person

      2. > An AI could probably conceal fairly well which self-improvement strategy it is using.

        That’s not obviously true, since the “inspectors” would be other AIs at comparable levels of speed and intelligence. Humans can sometimes do an ok job at nuclear inspections and spying on foreign governments (although there have still been plenty of arms races between powerful nations…).

        Liked by 1 person

      3. >That’s not obviously true, since the “inspectors” would be other AIs at comparable levels of speed and intelligence. Humans can sometimes do an ok job at nuclear inspections and spying on foreign governments (although there have still been plenty of arms races between powerful nations…).

        Yeah, good point. OTOH AIs have fewer evolutionarily built in signaling mechanisms. Also colonizing AIs are huge. So, they would only be able to inspect parts of each other.


      4. > OTOH AIs have fewer evolutionarily built in signaling mechanisms.

        Good point. But they may come up with inventive artificial signaling mechanisms that we’re too dumb to have invented yet. 🙂

        > Also colonizing AIs are huge. So, they would only be able to inspect parts of each other.

        I envision these races taking place on Earth, since I would guess a single faction will have won before space colonization starts in earnest. But maybe not. And I agree regarding alien encounters.

        Liked by 1 person

  1. > It is unclear whether this hope is realistic, but it seems plausible enough to significantly influence decision making.

    Arms races could still happen in a slow-takeoff scenario, for the same reason as countries compete to be leaders on any profitable and powerful technology.

    > different AIs could lie much farther apart in mind space and thus differ much more strongly in values.

    Yeah, unless all the AIs that humans build have inherited the values of their creators quite closely.

    There might be other differences in outcomes between human arms races vs. AI arms races, such as because humans have more hard-wired social emotions. But it’s not obvious how those differences would affect the outcomes.

    > However, self-improvement races suggests that this may not work in a systematic way.

    I don’t understand what you mean here. 🙂 I can imagine two possibilities:

    1. Assuming a multipolar outcome on Earth, it’s unlikely AI control will succeed (due to arms-race dynamics), so it’s unlikely Earth-originating AI will be more humane than ET AI.

    2. Even given a singleton AI on Earth, it would want to race against ET AIs and so wouldn’t be as careful with its own self-improvements.

    I’m skeptical about #2, since the time scales for the development of life appear to be measured in billions of years, so spending an extra few centuries or whatever to do a better job at goal-preserving self-improvement seems like a small cost to pay in terms of probability of winning the light cone, but it has huge benefits. Of course, if goal preservation is really hard and would slow down an AI’s progress by not just a few centuries but by like a permanent 10% or something, or would make the end-stage AI less powerful than it could have been had it used hacks, then there would be a more substantial tradeoff between safety vs. probability of winning a battle against aliens.

    Liked by 1 person

    1. >Arms races could still happen in a slow-takeoff scenario, for the same reason as countries compete to be leaders on any profitable and powerful technology.

      Yeah, good point! 🙂 I changed the text a little in response.

      I agree with your points on the second passage you quoted.

      Regarding the last point you commented on: In this passage specifically, I am mainly talking about #2, although #1 could be a significant worry as well, especially in soft takeoff scenarios. (In a hard takeoff, the first AI would immediately take control of the Internet or use other ways of preventing the creation of another AI.)

      Yes, the argument does not work if making your self-improvement reliable is something you have to focus on in the first couple of centuries and then does not bother you, anymore. I think there is a good chance, though, that this is not the case and that a limitation to proven or extremely-likely-to-be-safe self-improvements slows down progress basically forever. On the given time scales this could make a big difference even if the slow down from proving everything is small. For example, if n>1 and epsilon >0 (n+epsilon)^t eventually becomes much larger than n^t and even t^(n+epsilon) eventually becomes much larger than t^n, where t represents time and epsilon represents the slow down.

      Liked by 1 person

      1. Makes sense. I agree that an extreme degree of certainty about self-improvements would probably handicap an AI permanently. (Indeed, I suspect that a Godel machine would run too slowly to take over the world at all, though this judgement is just based on priors about computational intractability rather than expertise on my part about the Godel machine.) It’s less clear for modest degrees of control over future agents. Humans are ok at goal preservation over decades, but over millennia, less so. (Imagine what a medieval pope would think of Pope Francis.) Still, the medieval Catholic Church does have a nontrivial effect on present-day society due to historical baggage.

        The question also depends on how long it takes to become a mature AGI civilization and whether one’s power levels off or not. If there’s only so much an AGI can do to win wars against aliens, and it reaches that stage within a mere few million years, then being slower might not be fatal.

        Anyway, I agree there’s a non-small chance that you’re right and that maybe most AGIs will be forced to abandon a high degree of goal preservation. Interestingly, this somewhat reduces our expected influence on the far future because the less goals will be preserved, the less what we do now matters (except insofar as we affect whether space gets colonized at all). It should also increase the degree of pessimism about the far future that effective altruists hold, since it suggests that long-term preservation of human(e) values may be harder than we thought.

        Liked by 1 person

      2. I guess one final implication is that the scenario in which Earth is not just somewhat rare but actually extremely rare may get more prudential weight than before, because in that scenario, an AI would face less pressure to win a self-improvement race with aliens, so that’s the main scenario where our influence on AI outcomes is least likely to be washed away over time.

        This consideration is in opposition to the argument that we should focus more on scenarios where Earth-like planets are not too rare because if so, there are more copies of us in the universe.

        Liked by 1 person

  2. Pingback: Risks of Astronomical Future Suffering – Foundational Research Institute

  3. Pingback: How the Simulation Argument Dampens Future Fanaticism – Foundational Research Institute

  4. Pingback: Risks of Astronomical Future Suffering – Foundational Research Institute

  5. Pingback: Self-improvement races – Foundational Research Institute

  6. Pingback: Self-improvement races – Center on Long-Term Risk

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s