Superintelligence: Paths, Dangers, Strategies (35 page)

Read Superintelligence: Paths, Dangers, Strategies Online

Authors: Nick Bostrom

Tags: #Science, #Philosophy, #Non-Fiction

In some cases, the mere ability to
detect
treaty violations is sufficient to establish the confidence needed for a deal. In other cases, however, there is a need for some mechanism to
enforce
compliance or mete out punishment if a violation
should occur. The need for an enforcement mechanism may arise if the threat of the wronged party withdrawing from the treaty is not enough to deter violations, for instance if the treaty-violator would gain such an advantage that he would not subsequently care how the other party responds.

If highly effective motivation selection methods are available, this enforcement problem could be solved by empowering an independent agency with sufficient police or military strength to enforce the treaty even against the opposition of one or several of its signatories. This solution requires that the enforcement agency can be trusted. But with sufficiently good motivation selection techniques, the requisite confidence might be achieved by having all the parties to the treaty jointly oversee the design of the enforcement agency.

Handing over power to an external enforcement agency raises many of the same issues that we confronted earlier in our discussions of a unipolar outcome (one in which a singleton arises prior to or during the initial machine intelligence revolution). In order to be able to enforce treaties concerning the vital security interests of rival states, the external enforcement agency would in effect need to constitute a singleton: a global superintelligent Leviathan. One difference, however, is that we are now considering a post-transition situation, in which the agents that would have to create this Leviathan would have greater competence than we humans currently do. These Leviathan-creators may themselves already be superintelligent. This would greatly improve the odds that they could solve the control problem and design an enforcement agency that would serve the interests of all the parties that have a say in its construction.

Aside from the costs of monitoring and enforcing compliance, are there any other obstacles to global coordination? Perhaps the major remaining issue is what we can refer to as
bargaining costs
.
41
Even when there is a possible bargain that would benefit everybody involved, it sometimes does not get off the ground because the parties fail to agree on how to divide the spoils. For example, if two persons could make a deal that would net them a dollar in profit, but each party feels she deserves sixty cents and refuses to settle for less, the deal will not happen and the potential gain will be forfeited. In general, negotiations can be difficult or protracted, or remain altogether barren, because of strategic bargaining choices made by some of the parties.

In real life, human beings frequently succeed in reaching agreements despite the possibility for strategic bargaining (though often not without considerable expenditure of time and patience). It is conceivable, however, that strategic bargaining problems would have a different dynamic in the post-transition era. An AI negotiator might more consistently adhere to some particular formal conception of rationality, possibly with novel or unanticipated consequences when matched with other AI negotiators. An AI might also have available to it moves in the bargaining game that are either unavailable to humans or very much more difficult for humans to execute, including the ability to precommit to a policy or a course of action. While humans (and human-run institutions) are occasionally able to precommit—with imperfect degrees of credibility and specificity—some
types of machine intelligence might be able to make arbitrary unbreakable precommitments and to allow negotiating partners to confirm that such a precommitment has been made.
42

The availability of powerful precommitment techniques could profoundly alter the nature of negotiations, potentially giving an immense edge to an agent that has a first-mover advantage. If a particular agent’s participation is necessary for the realization of some prospective gains from cooperation, and if that agent is able to make the first move, it would be in a position to dictate the division of the spoils by precommitting not to accept any deal that gives it less than, say, 99% of the surplus value. Other agents would then be faced with the choice of either getting nothing (by rejecting the unfair proposal) or getting 1% of the value (by caving in). If the first-moving agent’s precommitment is publicly verifiable, its negotiating partners could be sure that these are their only two options.

To avoid being exploited in this manner, agents might precommit to refuse blackmail and to decline all unfair offers. Once such a precommitment has been made (and successfully publicized), other agents would not find it in their interest to make threats or to precommit themselves to only accepting deals tilted in their own favor, because they would know that threats would fail and that unfair proposals would be rejected. But this just demonstrates again that the advantage is with the first-mover. The agent who moves first can choose whether to parlay its position of strength only to deter others from taking unfair advantage, or to make a grab for the lion’s share of future spoils.

Best situated of all, it might seem, would be the agent who starts out with a temperament or a value system that makes him impervious to extortion or indeed to any offer of a deal in which his participation is indispensable but he is not getting almost all of the gains. Some humans seem already to possess personality traits corresponding to various aspects of an uncompromising spirit.
43
A high-strung disposition, however, could backfire should it turn out that there are other agents around who feel entitled to more than their fair share and are committed to not backing down. The unstoppable force would then encounter the unmovable object, resulting in a failure to reach agreement (or worse: total war). The meek and the akratic would at least get something, albeit less than their fair share.

What kind of game-theoretic equilibrium would be reached in such a post-transition bargaining game is not immediately obvious. Agents might choose more complicated strategies than the ones considered here. One
hopes
that an equilibrium would be reached centered on some fairness norm that would serve as a Schelling point—a salient feature in a big outcome space which, because of shared expectations, becomes a likely coordination point in an otherwise underdetermined coordination game. Such an equilibrium might be bolstered by some of our evolved dispositions and cultural programming: a common preference for fairness could, assuming we succeed in transferring our values into the post-transition era, bias expectations and strategies in ways that lead to an attractive equilibrium.
44

In any case, the upshot is that with the possibility of strong and flexible forms of precommitment, outcomes of negotiations might take on an unfamiliar guise. Even if the post-transition era started out multipolar, it might be that a singleton would arise almost immediately as a consequence of a negotiated treaty that resolves all important global coordination problems. Some transaction costs, perhaps including monitoring and enforcement costs, might plummet with the new technological capabilities available to advanced machine intelligences. Other costs, in particular costs related to strategic bargaining, might remain significant. But however strategic bargaining affects the nature of the agreement that is reached, there is no clear reason why it would long delay the reaching of some agreement if an agreement were ever to be reached. If no agreement is reached, then some form of fighting might take place; and either one faction might win, and form a singleton around the winning coalition, or the result might be an interminable conflict, in which case a singleton may never form and the overall outcome may fall terribly short of what could and should have been achieved if humanity and its descendants had acted in a more coordinated and cooperative fashion.

 

We have seen that multipolarity, even if it could be achieved in a stable form, would not guarantee an attractive outcome. The original principal–agent problem remains unsolved, and burying it under a new set of problems related to post-transition global coordination failures may only make the situation worse. Let us therefore return to the question of how we could safely keep a single superintelligent AI.

CHAPTER 12
Acquiring values
 

Capability control is, at best, a temporary and auxiliary measure. Unless the plan is to keep superintelligence bottled up forever, it will be necessary to master motivation selection. But just how could we get some value into an artificial agent, so as to make it pursue that value as its final goal? While the agent is unintelligent, it might lack the capability to understand or even represent any humanly meaningful value. Yet if we delay the procedure until the agent is superintelligent, it may be able to resist our attempt to meddle with its motivation system—and, as we showed in
Chapter 7
, it would have convergent instrumental reasons to do so. This value-loading problem is tough, but must be confronted
.

The value-loading problem
 

It is impossible to enumerate all possible situations a superintelligence might find itself in and to specify for each what action it should take. Similarly, it is impossible to create a list of all possible worlds and assign each of them a value. In any realm significantly more complicated than a game of tic-tac-toe, there are far too many possible states (and state-histories) for exhaustive enumeration to be feasible. A motivation system, therefore, cannot be specified as a comprehensive lookup table. It must instead be expressed more abstractly, as a formula or rule that allows the agent to decide what to do in any given situation.

One formal way of specifying such a decision rule is via a utility function. A utility function (as we recall from
Chapter 1
) assigns value to each outcome that might obtain, or more generally to each “possible world.” Given a utility function, one can define an agent that maximizes expected utility. Such an agent selects at each time the action that has the highest expected utility. (The expected utility is calculated by weighting the utility of each possible world with the subjective probability of that world being the actual world conditional on a particular action being taken.) In reality, the possible outcomes are too numerous for the expected
utility of an action to be calculated exactly. Nevertheless, the decision rule and the utility function together determine a normative ideal—an optimality notion—that an agent might be designed to approximate; and the approximation might get closer as the agent gets more intelligent.
1
Creating a machine that can compute a good approximation of the expected utility of the actions available to it is an AI-complete problem.
2
This chapter addresses another problem, a problem that remains even if the problem of making machines intelligent is solved.

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly.
3
We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent
over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity.
4
How could our programmer transfer this complexity into a utility function?

One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the
only
thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal—recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible
human
value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
5

If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.

Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.

Evolutionary selection
 

Evolution has produced an organism with human values at least once. This fact might encourage the belief that evolutionary methods are the way to solve the value-loading problem. There are, however, severe obstacles to achieving safety along this path. We have already pointed to these obstacles at the end of
Chapter 10
when we discussed how powerful search processes can be dangerous.

Evolution can be viewed as a particular class of search algorithms that involve the alternation of two steps, one expanding a population of solution candidates by generating new candidates according to some relatively simple stochastic rule (such as random mutation or sexual recombination), the other contracting the population by pruning candidates that score poorly when tested by an evaluation function. As with many other types of powerful search, there is the risk that
the process will find a solution that satisfies the formally specified search criteria but not our implicit expectations. (This would hold whether one seeks to evolve a digital mind that has the same goals and values as a typical human being, or instead a mind that is, for instance, perfectly moral or perfectly obedient.) The risk would be avoided if we could specify a formal search criterion that accurately represented all dimensions of our goals, rather than just one aspect of what we think we desire. But this is precisely the value-loading problem, and it would of course beg the question in this context to assume that problem solved.

There is a further problem:

The total amount of suffering per year in the natural world is beyond all decent contemplation. During the minute that it takes me to compose this sentence, thousands of animals are being eaten alive, others are running for their lives, whimpering with fear, others are being slowly devoured from within by rasping parasites, thousands of all kinds are dying of starvation, thirst and disease.
6

 

Even just within our species, 150,000 persons are destroyed each day while countless more suffer an appalling array of torments and deprivations.
7
Nature might be a great experimentalist, but one who would never pass muster with an ethics review board—contravening the Helsinki Declaration and every norm of moral decency, left, right, and center. It is important that we not gratuitously replicate such horrors
in silico
. Mind crime seems especially difficult to avoid when evolutionary methods are used to produce human-like intelligence, at least if the process is meant to look anything like actual biological evolution.
8

Reinforcement learning
 

Reinforcement learning is an area of machine learning that studies techniques whereby agents can learn to maximize some notion of cumulative reward. By constructing an environment in which desired performance is rewarded, a reinforcement-learning agent can be made to learn to solve a wide class of problems (even in the absence of detailed instruction or feedback from the programmers, aside from the reward signal). Often, the learning algorithm involves the gradual construction of some kind of evaluation function, which assigns values to states, state–action pairs, or policies. (For instance, a program can learn to play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form of learning about value. However, what is being learned is not new
final
values but increasingly accurate
estimates of the instrumental values
of reaching particular states (or of taking particular actions in particular states, or of following particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And reward consists of specially designated percepts received from the environment.
Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.
9

These remarks do not imply that reinforcement-learning methods could never be used in a safe seed AI, only that they would have to be subordinated to a motivation system that is not itself organized around the principle of reward maximization. That, however, would require that a solution to the value-loading problem had been found by some other means than reinforcement learning.

Associative value accretion
 

Now one might wonder: if the value-loading problem is so tricky, how do we ourselves manage to acquire our values?

One possible (oversimplified) model might look something like this. We begin life with some relatively simple starting preferences (e.g. an aversion to noxious stimuli) together with a set of dispositions to acquire additional preferences in response to various possible experiences (e.g. we might be disposed to form a preference for objects and behaviors that we find to be valued and rewarded in our culture). Both the simple starting preferences and the dispositions are innate, having been shaped by natural and sexual selection over evolutionary timescales. Yet which preferences we end up with as adults depends on life events. Much of the information content in our final values is thus acquired from our experiences rather than preloaded in our genomes.

For example, many of us love another person and thus place great final value on his or her well-being. What is required to represent such a value? Many elements are involved, but consider just two: a representation of “person” and a representation of “well-being.” These concepts are not directly coded in our DNA. Rather, the DNA contains instructions for building a brain, which, when placed in a typical human environment, will over the course of several years develop a world model that includes concepts of persons and of well-being. Once formed, these concepts can be used to represent certain meaningful values. But some mechanism needs to be innately present that leads to values being formed around
these
concepts, rather than around other acquired concepts (like that of a flowerpot or a corkscrew).

The details of how this mechanism works are not well understood. In humans, the mechanism is probably complex and multifarious. It is easier to understand the phenomenon if we consider it in a more rudimentary form, such as filial imprinting in nidifugous birds, where the newly hatched chick acquires a desire for physical proximity to an object that presents a suitable moving stimulus within the first day after hatching. Which particular object the chick desires to be near depends on its experience; only the general disposition to imprint in this way is genetically determined. Analogously, Harry might place a final value on Sally’s well-being; but had the twain never met, he might have fallen in love with
somebody else instead, and his final values would have been different. The ability of our genes to code for the construction of a goal-acquiring mechanism explains how we come to have final goals of great informational complexity, greater than could be contained in the genome itself.

Other books

My Secret Guide to Paris by Lisa Schroeder
Safe from Harm (9781101619629) by Evans, Stephanie Jaye
A Murderous Glaze by Melissa Glazer
Killing Hitler by Roger Moorhouse
Return to Dust by Andrew Lanh
Double Dutch by Sharon M. Draper
Born Evil by Kimberley Chambers