Suicide as wireheading

In my last post you learned about wireheading. In this post, I’ll bring to attention a specific example of wireheading: suicide or, as one would call it for robots, self-destruction. What differentiates self-destruction from more commonly discussed forms of wireheading is that it does not lead to a pleasurable or very positive internal state, but it too is a measure that does not solve problems in the external world as much as it changes one’s state of mind (into nothingness).

Let’s consider the reinforcement learner who has an internal module which generates rewards between -1 and 1. Now, there will probably be situations in which the reinforcement learner has to expect to mainly receive negative rewards in the future. Assuming that zero utility is assigned to nonexistence (similar to how humans think of the states prior to their existence or phases of sleep as neutral to their hedonic well-being) a reinforcement learner may well want to end its existence to increase its utility. It should be noted that typical reinforcement learners as studied in the AI literature don’t have a concept of death. However, a reinforcement learner that is meant to work in the real world would have to think about its death in some way (which may be completely confused by default). An example view is one in which the “reinforcement learner” actually has a utility function that calculates utility from the sum of the outputs of the physical reward module. In this case, suicide would reduce the number of outputs of the reward module which can increase expected utility if rewards are expected to be more often or to a stronger extent negative in the future. So, we have another example of potentially rational wireheading.

However, for many agents self-destruction is irrational. For example, if my goal is to reduce suffering then I may feel bad once I learn about instances of extreme suffering and about how much suffering there is in the world. Killing myself ends my bad feeling, but prevents me from achieving my goal of reducing suffering. Therefore, it’s irrational given my goals.

The actual motivations of real-world people committing suicide seem a lot more complicated most of the time. Many instances of suicide do seem to be about bad mental states. Also, many attempt to use their death to achieve goals in the outside world as well, the most prominent example being seppuku (or harakiri), in which suicide is performed to maximize one’s reputation or honor.

As a final note, looking at suicide through the lens of wireheading provides one way of explaining why so many beings who live very bad lives don’t commit suicide. If an animal’s goals are things like survival, health, mating etc. that correlate with reproductive success, the animal can’t achieve its goals by suicide. Even if the animal expects its life to continue in an extremely bad and unsuccessful way with near certainty, it behaves rationally if it continues to try to survive and reproduce rather than alleviate its own suffering. Avoiding pain is only one of the goals of the animal and if intelligent, rational agents are to be prevented from committing suicide in a Darwinian environment, reducing their own pain better not be the dominating consideration. In sum, the fact that an agent does not commit suicide tells you little about the state of its well-being if it has goals about the outside world, which we should expect to be the case for most evolved beings.

Wireheading

Some of my readers may have heard of the concept of wireheading:

Wireheading is the artificial stimulation of the brain to experience pleasure, usually through the direct stimulation of an individual’s brain’s reward or pleasure center with electrical current. It can also be used in a more expanded sense, to refer to any kind of method that produces a form of counterfeit utility by directly maximizing a good feeling, but that fails to realize what we value.

From my experience, people are confused about what exactly wireheading is and whether it is rational to pursue or not, so before I discuss some potentially new thoughts on wireheading in the next post, I’ll elaborate on that definition a bit and give a few examples.

Let’s say your only goal is to be adored by as many people as possible for being a superhero. Then thinking that you are such a superhero would probably be the thing that makes you happy. So, you would probably be happy while playing a superhero video game that is so immersive that while playing you actually believe that you are a superhero and forget dim reality for a while. So, if you just wanted to be happy or feel like a superhero you would play this video game a lot given that it is so difficult to become a superhero in real life. But this isn’t what you want! You don’t want to believe that you’re a superhero. You want to be a superhero. Playing the video game does not help you to attain that goal, instead push-ups and spinach (or, perhaps, learning about philosophy, game theory and theoretical computer science) help you to be a superhero.

So, if you want to become a superhero, fooling yourself into believing that you are a superhero obviously does not help you. It even distracts you. In this example, playing the video game was an example of wireheading (in the general sense) that didn’t even require you to open your skull. You just had to stimulate your sensors with the video game and not resist the immersive video game experience. The goal of being a superhero is an example of a goal that refers to the poutside world. It is a goal that cannot be achieved by changing your state of mind or your beliefs or the amount of dopamine in your brain.

So, the first thing you need to know about wireheading is that if your goals are about the outside world, you need to be irrational or extremely confused or in a very weird position (where you are paid to wirehead, for example) to do it. Let me repeat (leaving out the caveats): If your utility function assigns values to states of the world, you don’t wirehead!

What may be confusing about wireheading is that for some subset of goals (or utility functions), wireheading actually is a rational strategy. Let’s say your goal is to feel (and not necessarily be) important like a superhero. Or to not feel bad about the suffering of others (like the millions of fish which seem to die a painful death from suffocation right now). Or maybe your goal is actually to maximize the amount of dopamine in your brain. For such agents, manipulating their brain directly and instilling false beliefs in themselves can be a rational strategy! It may look crazy from the outside, but according to their (potentially weird) utility functions, they are winning.

There is a special case of agents whose goals refer to their own internals, which is often studied in AI: reinforcement learners. These agents basically have some reward signal which they aim to maximize as their one and only goal. The reward signal may come from a module in their code which has access to the sensors. Of course, AI programmers usually don’t care about the size of the AI’s internal reward numbers but instead use the reward module of the AI as a proxy for some goals the designer wants to be achieved (world peace, the increased happiness of the AI’s users, increased revenue for HighDepthIntellect Inc. …). However, the reinforcement learning AI does not care about these external goals – it does not even necessarily know about them, although that wouldn’t make a difference. Given that the reinforcement learner’s goal is about its internal state, it would try to manipulate its internal state towards higher rewards if it gets the chance no matter whether this correlates with what the designers originally wanted. One way to do this would be to reprogram its reward module, but assuming that the reward module is not infallible, a reward-based agent could also feed its sensors with information that leads to high rewards even without achieving the goals that the AI was built for. Again, this is completely rational behavior. It achieves the goal of increasing rewards.

So, one reason for confusion about wireheading is that there actually are goal systems under which wireheading is a rational strategy. Whether wireheading is rational depends mainly on your goals and given that goals are different from facts the question of whether wireheading is good or bad is not purely a question of facts.

What makes this extra-confusing is that the goals of humans are a mix between preferences regarding their own mental states and preferences about the external world. For example, I have both a preference for not being in pain but also a preference against most things that are causing the pain. People enjoy fun activities (sex, taking drugs, listening to music etc.) for how it feels to be involved in them, but they also have a preference for a more just world with less suffering. The question “Do you really want to know?” is asked frequently and it’s often unclear what the answer is. If all of your goals were about the outside world and not your state of mind, you would (usually) answer such questions affirmatively – knowledge can’t hurt you, especially because on average any piece of evidence can’t “make things worse” than you expected things to be before receiving that piece of evidence. Sometimes, people are even confused about why exactly they engage in certain activities and specifically about whether it is about fulfilling some preference in the outside world or changing one’s state of mind. For example, most who donate to charity think that they do it to help kids in Africa, but many also want the warm feelings from having made such a donation. And often, both are relevant. For example, I want to prevent suffering, but I also have a preference for not thinking about specific instances of suffering in a non-abstract way. (This is partly instrumental, though: learning about a particularly horrific example of suffering often makes me a lot less productive for hours. Gosh, so many preferences…)

There is another thing which can make this even more confusing. Depending on my ethical system I may value people’s actual preference fulfillment or the quality of their subjective states (the former is called preference utilitarianism and the latter hedonistic utilitarianism). Of course, you can also value completely different things like the existence of art, but I think it’s fair to say that most (altruistic) humans value at least one of the two to a large extent. For a detailed discussion of the two, consider this essay by Brian Tomasik, but let’s take a look at an example to see how they differ and what the main arguments are. Let’s say, your friend Mary writes a diary, which contains information that is of value to you (be it for entertainment or something else). However, Mary, like many who write a diary, does not want others to read the content of her diary. She’s also embarrassed about the particular piece of information that you are interested in. Some day you get the chance to read in her diary without her knowing. (We assume that you know with certainty that Mary is not going to learn about your betrayal and that overall the action has no consequences other than fulfilling your own preferences.) Now, is it morally reprehensible for you to do so? A preference utilitarian would argue that it is, because you decrease Mary’s utility function. Her goal of not having anyone know the content of her diary is not achieved. A hedonistic utilitarian would argue that her mental state is not changed by your action and so she is not harmed. The quality of her life is not affected by your decision.

This divide in moral views directly applies to another question of wireheading: Should you assist others in wireheading or even actively wirehead other agents? If you are a hedonistic utilitarian you should, if you are a preference utilitarian you shouldn’t (unless the subject’s preferences are mainly about her own state of mind). So, again, whether wireheading is a good or a bad thing to do is determined by your values and not (only) by facts.

Self-improvement races

Most of my readers are probably familiar with the problem of AI safety: If humans could create super-human level artificial intelligence the task of programming it in such a way that it behaves as intended is non-trivial. There is a risk that the AI will act in unexpected ways and given its super-human intelligence, it would then be hard to stop.

I assume fewer are familiar with the problem of AI arms races. (If you are, you may well skip this paragraph.) Imagine two opposing countries which are trying to build a super-human AI to reap the many benefits and potentially attain a decisive strategic advantage, perhaps taking control of the future immediately. (It is unclear whether this latter aspiration is realistic, but it seems plausible enough to significantly influence decision making.) This creates a strong motivation for the two countries to develop AI as fast as possible. This is the case especially if the countries would dislike a future controlled by the other. For example, North Americans may fear a future controlled by China. In such cases, countries would want to invest most available resources into creating AI first with less concern for whether it is safe. After all, letting the opponent win may be similarly bad as having an AI with entirely random goals. It turns out that under certain conditions both countries would invest close to no resources into AI safety and all resources into AI capability research (at least, that’s the Nash equilibrium), thus leading to an unintended outcome with near certainty. If countries are sufficiently rational, they might be able to cooperate to mitigate risks of creating uncontrolled AI. This seems especially plausible given that the values of most humans are actually very similar to each other relative to how alien the goals of a random AI would probably be. However, given that arms races have frequently occurred in the past, a race toward human-level AI remains a serious worry.

Handing power over to AIs holds both economic promise and a risk of misalignment. Similar problems actually haunt humans and human organizations. Say, a charity hires a new director who has been successful in other organizations. Then this creates the opportunity for the charity to rise in influence. However, it is also possible that the charity changes in a way that the people currently or formerly in charge wouldn’t approve of. Interestingly, the situation is similar for AIs which create other AIs or self-improve themselves. Learning and self-improvement are the paths to success. However, self-improvements carry the risk of affecting the goal-directed behavior of the system.

The existence of this risk seems true prima facie: It should be strictly easier to find self-improvements that “probably” work than it is to identify self-improvements that are guaranteed to work. The former is a superset of the latter. So, AIs which are willing to take risks while self-improving can improve faster. (ETA: Cf. page 154 of Max Tegmark’s 2017 book Life 3.0 (published after this blog post was originally published).)

There are also formal justifications for the difficulty of proving self-improvements to be correct. Specifically, Rice’s theorem states that for any non-trivial property p, there is no way of deciding for all programs whether they have this property p. (If you know about the undecidability of the halting problem, Rice’s theorem follows almost immediately from that.) As a special case, deciding for all programs whether they are pursuing some goals is impossible. Of course, this does not mean that proving self-improvements to be correct is impossible. After all, an AI could just limit itself to the self-improvements that it can prove correct (see this discussion between Eliezer Yudkowsky and Mark Waser). However, without this limitation – e.g., if it can merely test some self-improvement empirically and implement it if it seems to work – an AI can use a broader range of possible self-modifications and thus improve more quickly. (In general, testing a program also appears to be a lot easier than formally verifying it, but that’s a different story.) Another relevant problem from (provability) logic may be Löb’s theorem which roughly states that a logical system with Peano arithmetic can’t prove another logical mechanism with that power to be correct.

Lastly, consider Stephen Wolfram’s more fuzzy concept of computational irreducibility. It basically states that as soon as a system can produce arbitrarily complex behavior (i.e., as soon as it is universal in some sense), predicting how most aspects of the system will behave becomes fundamentally hard. Specifically, he argues that for most (especially for complex and universal) systems, there is no way to find out how they behave other than running them.

So, self-improvement can give AIs advantages and ultimately the upper hand in a conflict, but if done too hastily, it can also lead to goal drift. Now, consider the situation in which multiple AIs compete in a head-to-head race. Based on the above considerations this case becomes very similar to the AI arms races between groups of humans. Every single AI has incentives to take risks to increase its probability of winning, but overall this can lead to unintended outcomes with near certainty. There are reasons to assume that this self-improvement race dynamic will be more of a problem for AIs than it is for human factions. The goals of different AIs could diverge much more strongly than the goals of different humans. Whereas human factions may prefer the enemy’s win over a takeover by an uncontrolled AI, an AI with human values confronting an AI with strange values has less to lose from risky self-modifications. (There are some counter-considerations as well. For instance, AIs may be better at communicating and negotiating compromises.)

Thus, a self-improvement race between AIs seems to share the bad aspects of AI arms races between countries. This has a few implications:

  • Finding out (whether there is) a way for AIs to cooperate and prevent self-improvement races and other uncooperative outcomes becomes more important.
  • Usually, one argument for creating AI and colonizing space is that Earth-originating aligned AI could prevent other, less compassionate AIs (uncontrolled or created by uncompassionate ETs) from colonizing space. So, according to this argument, even if you don’t value what humans or human-controlled AIs would do in space, you should still choose it as the lesser of two (or more) evils. However, the problem of self-improvement races puts this argument into question.
  • On a similar note, making the universe more crowded with AIs, especially ones with weird (evolutionary uncommon) values or ones that are not able to cooperate, may be harmful as it could lead to results that are bad for everyone (except for the AI which is created in a self-modification gone wrong).

Acknowledgment: This work was funded by the Foundational Research Institute (now the Center on Long-Term Risk).