Most of my readers are probably familiar with the problem of AI safety: If humans could create super-human level artificial intelligence the task of programming it in such a way that it behaves as intended is non-trivial. There is a risk that the AI will behave in unexpected ways and given its super-human intelligence, it would then be hard to stop.
I assume, fewer are familiar with the problem of AI arms races. (If you do, you may well skip this paragraph.) Imagine two opposing countries which are trying to build a super-human AI to reap the many benefits and potentially get a decisive strategic advantage and maybe even take control of the future immediately with the other country left behind or even destroyed. (It is unclear whether this latter hope is realistic, but it seems plausible enough to significantly influence decision making.) This creates a strong motivation for the two countries to create AI as fast as possible. This is the case especially if the countries would not like a future controlled by the other. For example, North Americans may dislike a future controlled by the Chinese, because individualism is valued less in China. In such cases, countries would want to invest most available resources into creating AI first with less concern for whether it is safe. After all, letting the opponent win may be similarly bad as having an AI with completely random goals. It turns out that under certain conditions both countries would invest basically no resources into AI safety and all resources into AI capability research (at least, that’s the Nash equilibrium), thus leading to an unintended outcome with near certainty. If countries are sufficiently rational, they might be able to cooperate and will probably take not too great risks of creating uncontrolled AI given that the values of most humans are actually very similar relative to how alien the values of some random AI would be. Given that arms races are common in history, a race toward AI remains a serious worry.
So, for humans to create AIs which they can hand the power to is risky. There is a chance that the AI’s values will be aligned with those of its creators and this holds great benefits, but there is also a chance that they lose control. Similar problems actually haunt humans and human organizations. Say, a charity brings in a new director who has been very successful in other organizations. Then this creates the opportunity for the charity to rise in influence. However, it is also possible that the charity changes in a way that the people in charge of the old version of the charity wouldn’t approve of. Interestingly, the situation is similar for AIs, which create other AIs or self-improve themselves. Like for humans, learning and self-improvement is the path to success for AIs. However, for an AI there is a chance that the new (version of the) AI’s values are not actually aligned, anymore, with the values of the original AI.
For one, this is true prima facie: It should be strictly easier to find self-improvements that “probably” work than to find self-improvements that are guaranteed to work. The former kind is a super-set of the latter kind. So, AIs which are willing to take risks while self-improving can improve faster.
There are also actual formal justifications for the difficulty of proving self-improvements to be correct. Specifically, Rice’s theorem states that for any non-trivial property p, there is no way of deciding for all programs whether they have this property p. (If you know about the undecidability of the halting problem, Rice’s theorem basically follows from that.) As a special case, deciding for all programs whether they are pursuing some goals, is impossible. Of course, this does not mean that proving self-improvements to be correct is impossible. After all, an AI could just limit itself to the self-improvements that it can prove correct (see this discussion between Eliezer Yudkowsky and Mark Waser). However, without this limitation (i.e. if it can test some self-improvement for a while and if it works implement it), an AI can use a wider range of possible self-modifications and thus improve more quickly. In general, testing a program also seems to be a lot easier than formally verifying it, but that’s a different story. Another relevant problem from (provability) logic may be Löb’s theorem which roughly states that a logical system with Peano arithmetic can’t prove another logical mechanism with that power to be correct.
Lastly, consider Stephen Wolfram’s more fuzzy concept of computational irreducibility, which basically states that as soon as a system can produce arbitrarily complex behavior (i.e. as soon as it is universal in some sense), which Wolfram believes “most” systems to be capable of, predicting how most aspects of the system will behave becomes fundamentally difficult. Specifically, he argues that for most (especially complex and universal) systems, there is no way to find out how they behave other than running them.
So, self-improvement can give AIs advantages and ultimately give them the upper hand in a conflict, but if done too hastily, it can also lead to a loss of control. Now, consider the situation in which multiple AIs compete in a head-to-head race. Based on the above considerations this situation becomes very similar to the AI arms races between groups of humans. Every single AI has incentives to take risks to increase its probability of winning, but overall this can lead to unintended outcomes with near certainty. There are reasons to assume that this self-improvement race dynamic will be more of a problem for AIs than it is for human factions, because different AIs could lie much farther apart in mind space and thus differ much more strongly in values. Whereas human factions may greatly prefer the enemy’s win over a takeover by an uncontrolled AI, an AI with human values confronting an AI with strange values (e.g. maximizing paperclips) has less to lose from risky self-modifications.
So, a self-improvement race between AIs seems to share the bad aspects of AI arms races between countries. This has a number of implications:
- Finding out (whether there is) a way for AIs to cooperate and prevent self-improvement races and other uncooperative outcomes becomes more important.
- Usually one argument for creating AI and colonizing space is that earth-originating Friendly AI could prevent other, less compassionate AIs (uncontrolled or created by uncompassionate ETs) from colonizing space. So, according the argument, even if you dislike what humans or human-controlled AIs would do in space, you should still choose it as the lesser of two (or more) evils. However, self-improvement races suggests that this may not work in a systematic way.
- On a similar note, making the universe more crowded with AIs, especially ones with weird (evolutionary uncommon) values or ones that are not able to cooperate, may be very uncooperative. It may lead to results that are bad for everyone (except for the AI which is created in a self-modification gone wrong).