Superintelligence: Paths, Dangers, Strategies (28 page)

Read Superintelligence: Paths, Dangers, Strategies Online

Authors: Nick Bostrom

Tags: #Science, #Philosophy, #Non-Fiction

The direct specification of such a domesticity goal is more likely to be feasible than the direct specification of either a more ambitious goal or a complete rule set for operating in an open-ended range of situations. Significant challenges nonetheless remain. Care would have to be taken, for instance, in the definition of what it would be for the AI to “minimize its impact on the world” to ensure that the measure of the AI’s impact coincides with our own standards for what counts as a large or a small impact. A bad measure would lead to bad trade-offs. There are also other kinds of risk associated with building an oracle, which we will discuss later.

There is a natural fit between the domesticity approach and physical containment. One would try to “box” an AI such that the system is
unable
to escape while simultaneously trying to shape the AI’s motivation system such that it would be
unwilling
to escape even if it found a way to do so. Other things equal, the existence of multiple independent safety mechanisms should shorten the odds of success.
27

Indirect normativity
 

If direct specification seems hopeless, we might instead try indirect normativity. The basic idea is that rather than specifying a concrete normative standard directly, we specify a process for deriving a standard. We then build the system so that it is motivated to carry out this process and to adopt whatever standard the process arrives at.
28
For example, the process could be to carry out an investigation into the empirical question of what some suitably idealized version of us would prefer the AI to do. The final goal given to the AI in this example could be something along the lines of “achieve that which we would have wished the AI to achieve if we had thought about the matter long and hard.”

Further explanation of indirect normativity will have to await
Chapter 13
. There, we will revisit the idea of “extrapolating our volition” and explore various alterative formulations. Indirect normativity is a very important approach to motivation selection. Its promise lies in the fact that it could let us offload to the superintelligence much of the difficult cognitive work required to carry out a direct specification of an appropriate final goal.

Augmentation
 

The last motivation selection method on our list is augmentation. Here the idea is that rather than attempting to design a motivation system
de novo
, we start with a system that already has an acceptable motivation system, and enhance its cognitive faculties to make it superintelligent. If all goes well, this would give us a superintelligence with an acceptable motivation system.

This approach, obviously, is unavailing in the case of a newly created seed AI. But augmentation is a potential motivation selection method for other paths to superintelligence, including brain emulation, biological enhancement, brain–computer interfaces, and networks and organizations, where there is a possibility of building out the system from a normative nucleus (regular human beings) that already contains a representation of human value.

The attractiveness of augmentation may increase in proportion to our despair at the other approaches to the control problem. Creating a motivation system for a seed AI that remains reliably safe and beneficial under recursive self-improvement even as the system grows into a mature superintelligence is a tall order, especially if we must get the solution right on the first attempt. With augmentation, we would at least start with a system that has familiar and human-like motivations.

On the downside, it might be hard to ensure that a complex, evolved, kludgy, and poorly understood motivation system, like that of a human being, will not get corrupted when its cognitive engine blasts into the stratosphere. As discussed earlier, an imperfect brain emulation procedure that preserves intellectual functioning may not preserve all facets of personality. The same is true (though perhaps to a lesser degree) for biological enhancements of cognition, which might subtly affect motivation, and for collective intelligence enhancements of organizations and networks, which might adversely change social dynamics (e.g. in ways that debase the collective’s attitude toward outsiders or toward its own constituents). If superintelligence is achieved via any of these paths, a project sponsor would find guarantees about the ultimate motivations of the mature system hard to come by. A mathematically well-specified and foundationally elegant AI architecture might—for all its non-anthropomorphic otherness—offer greater transparency, perhaps even the prospect that important aspects of its functionality could be formally verified.

In the end, however one tallies up the advantages and disadvantages of augmentation, the choice as to whether to rely on it might be forced. If superintelligence is first achieved along the artificial intelligence path, augmentation is not
applicable. Conversely, if superintelligence is first achieved along some non-AI path, then many of the other motivation selection methods are inapplicable. Even so, views on how likely augmentation would be to succeed do have strategic relevance insofar as we have opportunities to influence which technology will first produce superintelligence.

Synopsis
 

A quick synopsis might be called for before we close this chapter. We distinguished two broad classes of methods for dealing with the agency problem at the heart of AI safety: capability control and motivation selection.
Table 10
gives a summary.

Table 10
Control methods

 

 

Capability control

 

Boxing methods

The system is confined in such a way that it can affect the external world only through some restricted, pre-approved channel. Encompasses physical and informational containment methods.

Incentive methods

The system is placed within an environment that provides appropriate incentives. This could involve social integration into a world of similarly powerful entities. Another variation is the use of (cryptographic) reward tokens. “Anthropic capture” is also a very important possibility but one that involves esoteric considerations.

Stunting

Constraints are imposed on the cognitive capabilities of the system or its ability to affect key internal processes.

Tripwires

Diagnostic tests are performed on the system (possibly without its knowledge) and a mechanism shuts down the system if dangerous activity is detected.

Motivation selection

 

Direct specification

The system is endowed with some directly specified motivation system, which might be consequentialist or involve following a set of rules.

Domesticity

A motivation system is designed to severely limit the scope of the agent’s ambitions and activities.

Indirect normativity

Indirect normativity could involve rule-based or consequentialist principles, but is distinguished by its reliance on an indirect approach to specifying the rules that are to be followed or the values that are to be pursued.

Augmentation

One starts with a system that already has substantially human or benevolent motivations, and enhances its cognitive capacities to make it superintelligent.

Each control method comes with potential vulnerabilities and presents different degrees of difficulty in its implementation. It might perhaps be thought that we should rank them from better to worse, and then opt for the best method. But that would be simplistic. Some methods can be used in combination whereas others are exclusive. Even a comparatively insecure method may be advisable if it can easily be used as an adjunct, whereas a strong method might be unattractive if it would preclude the use of other desirable safeguards.

It is therefore necessary to consider what package deals are available. We need to consider what type of system we might try to build, and which control methods would be applicable to each type. This is the topic for our next chapter.

CHAPTER 10
Oracles, genies, sovereigns, tools
 

Some say: “Just build a question-answering system!” or “Just build an AI that is like a tool rather than an agent!” But these suggestions do not make all safety concerns go away, and it is in fact a non-trivial question which type of system would offer the best prospects for safety. We consider four types or “castes”—oracles, genies, sovereigns, and tools—and explain how they relate to one another.
1
Each offers different sets of advantages and disadvantages in our quest to solve the control problem.

Oracles
 

An oracle is a question-answering system. It might accept questions in a natural language and present its answers as text. An oracle that accepts only yes/no questions could output its best guess with a single bit, or perhaps with a few extra bits to represent its degree of confidence. An oracle that accepts open-ended questions would need some metric with which to rank possible truthful answers in terms of their informativeness or appropriateness.
2
In either case, building an oracle that has a fully domain-general ability to answer natural language questions is an AI-complete problem. If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.

Oracles with domain-limited forms of superintelligence are also conceivable. For instance, one could conceive of a mathematics-oracle which would only accept queries posed in a formal language but which would be very good at answering such questions (e.g. being able to solve in an instant almost any formally expressed math problem that the human mathematics profession could solve by laboring collaboratively for a century). Such a mathematics-oracle would form a stepping-stone toward domain-general superintelligence.

Oracles with superintelligence in extremely limited domains already exist. A pocket calculator can be viewed as a very narrow oracle for basic arithmetical questions; an Internet search engine can be viewed as a very partial realization of an oracle with a domain that encompasses a significant part of general human declarative knowledge. These domain-limited oracles are tools rather than agents (more on tool-AIs shortly). In what follows, though, the term “oracle” will refer to question-answering systems that have domain-general superintelligence, unless otherwise stated.

To make a general superintelligence function as an oracle, we could apply both motivation selection and capability control. Motivation selection for an oracle may be easier than for other castes of superintelligence, because the final goal in an oracle could be comparatively simple. We would want the oracle to give truthful, non-manipulative answers and to otherwise limit its impact on the world. Applying a domesticity method, we might require that the oracle should use only designated resources to produce its answer. For example, we might stipulate that it should base its answer on a preloaded corpus of information, such as a stored snapshot of the Internet, and that it should use no more than a fixed number of computational steps.
3
To avoid incentivizing the oracle to manipulate us into giving it easier questions—which would happen if we gave it the goal of maximizing its accuracy across all questions we will ask it—we could give it the goal of answering only one question and to terminate immediately upon delivering its answer. The question would be preloaded into its memory before the program is run. To ask a second question, we would reset the machine and run the same program with a different question preloaded in memory.

Subtle and potentially treacherous challenges arise even in specifying the relatively simple motivation system needed to drive an oracle. Suppose, for example, that we come up with some explication of what it means for the AI “to minimize its impact on the world, subject to achieving certain results” or “to use only designated resources in preparing the answer.” What happens if the AI, in the course of its intellectual development, undergoes the equivalent of a scientific revolution involving a change in its basic ontology?
4
We might initially have explicated “impact” and “designated resources” using our own ontology (postulating the existence of various physical objects such as computers). But just as we have abandoned ontological categories that were taken for granted by scientists in previous ages (e.g. “phlogiston,” “élan vital,” and “absolute simultaneity”), so a superintelligent AI might discover that some of our current categories are predicated on fundamental misconceptions. The goal system of an AI undergoing an ontological crisis needs to be resilient enough that the “spirit” of its original goal content is carried over, charitably transposed into the new key.

Although making an oracle safe through the use of motivation selection might be far from trivial, it may nevertheless be easier than doing the same for an AI that roams the world in pursuit of some complicated goal. This is an argument for preferring that the first superintelligence be an oracle. Another point in favor of an oracle-first path is the greater susceptibility of an oracle to capability control. All
the standard boxing methods apply here. In addition, there may be methods that are distinctively applicable to oracles. For example, consider the risk that an oracle will answer questions not in a maximally truthful way but in such a way as to subtly manipulate us into promoting its own hidden agenda. One way to slightly mitigate this threat could be to create multiple oracles, each with a slightly different code and a slightly different information base. A simple mechanism could then compare the answers given by the different oracles and only present them for human viewing if all the answers agree. While one must assume that any superintelligent domain-general oracle would know about the existence of other oracles (it could infer this from its knowledge about human society), it might be feasible to hide some implementation details of each oracle from the views of the others. If the oracles are not able to communicate with one another, it might then be difficult for them to coordinate on how to answer our questions manipulatively. There are many ways of deviating from the truth, and the oracles may not all agree on which of these deviations is most attractive—whereas the truth itself is a Schelling point (a salient place for agreement in the absence of communication). So if the oracles achieve consensus, it might be a sign that they gave the true answer.
5

An oracle would ideally be trustworthy in the sense that we could safely assume that its answers are always accurate to the best of its ability. But even an untrustworthy oracle could be useful. We could ask such an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct. Many mathematical problems are of this kind. If we are wondering whether a mathematical proposition is true, we could ask the oracle to produce a proof or disproof of the proposition. Finding the proof may require insight and creativity beyond our ken, but checking a purported proof’s validity can be done by a simple mechanical procedure.

If it is expensive to verify answers (as is often the case on topics outside logic and mathematics), we can randomly select a subset of the oracle’s answers for verification. If they are all correct, we can assign a high probability to most of the other answers also being correct. This trick can give us a bulk discount on trustworthy answers that would be costly to verify individually. (Unfortunately, it cannot give us trustworthy answers that we are
unable
to verify, since a dissembling oracle may choose to answer correctly only those questions where it believes we could verify its answers.)

There could be important issues on which we could benefit from an augural pointer toward the correct answer (or toward a method for locating the correct answer) even if we had to actively distrust the provenance. For instance, one might ask for the solution to various technical or philosophical problems that may arise in the course of trying to develop more advanced motivation selection methods. If we had a proposed AI design alleged to be safe, we could ask an oracle whether it could identify any significant flaw in the design, and whether it could explain any such flaw to us in twenty words or less. Questions of this kind could elicit valuable information. Caution and restraint would be required, however, for us not to ask
too many
such questions—and not to allow ourselves to partake
of
too many
details of the answers given to the questions we do ask—lest we give the untrustworthy oracle opportunities to work on our psychology (by means of plausible-seeming but subtly manipulative messages). It might not take many bits of communication for an AI with the social manipulation superpower to bend us to its will.

Even if the oracle itself works exactly as intended, there is a risk that it would be misused. One obvious dimension of this problem is that an oracle AI would be a source of immense power which could give a decisive strategic advantage to its operator. This power might be illegitimate and it might not be used for the common good. Another more subtle but no less important dimension is that the use of an oracle could be extremely dangerous for the operator herself. Similar worries (which involve philosophical as well as technical issues) arise also for other hypothetical castes of superintelligence. We will explore them more thoroughly in
Chapter 13
. Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria.

Genies and sovereigns
 

A genie is a command-executing system: it receives a high-level command, carries it out, then pauses to await the next command.
6
A sovereign is a system that has an open-ended mandate to operate in the world in pursuit of broad and possibly very long-range objectives. Although these might seem like radically different templates for what a superintelligence should be and do, the difference is not as deep as it might at first glance appear.

With a genie, one already sacrifices the most attractive property of an oracle: the opportunity to use boxing methods. While one might consider creating a physically confined genie, for instance one that can only construct objects inside a designated volume—a volume that might be sealed off by a hardened wall or a barrier loaded with explosive charges rigged to detonate if the containment is breached—it would be difficult to have much confidence in the security of any such physical containment method against a superintelligence equipped with versatile manipulators and construction materials. Even if it were somehow possible to ensure a containment as secure as that which can be achieved for an oracle, it is not clear how much we would have gained by giving the superintelligence direct access to manipulators compared to requiring it instead to output a blueprint that we could inspect and then use to achieve the same result ourselves. The gain in speed and convenience from bypassing the human intermediary seems hardly worth the loss of foregoing the use of the stronger boxing methods available to contain an oracle.

If one
were
creating a genie, it would be desirable to build it so that it would obey the intention behind the command rather than its literal meaning, since a literalistic genie (one superintelligent enough to attain a decisive strategic advantage) might have a propensity to kill the user and the rest of humanity on its first use, for reasons explained in the section on malignant failure modes in
Chapter 8
. More broadly, it would seem important that the genie seek a charitable—and what human beings would regard as reasonable—interpretation of what is being commanded, and that the genie be motivated to carry out the command under such an interpretation rather than under the literalistic interpretation. The ideal genie would be a super-butler rather than an autistic savant.

A genie endowed with such a super-butler nature, however, would not be far from qualifying for membership in the caste of sovereigns. Consider, for comparison, the idea of building a sovereign with the final goal of obeying the spirit of the commands we would have given had we built a genie rather than a sovereign. Such a sovereign would mimic a genie. Being superintelligent, this sovereign would do a good job at guessing what commands we would have given a genie (and it could always ask us if that would help inform its decisions). Would there then really be any important difference between such a sovereign and a genie? Or, pressing on the distinction from the other side, consider that a superintelligent genie may likewise be able to predict what commands we will give it: what then is gained from having it await the actual issuance before it acts?

One might think that a big advantage of a genie over a sovereign is that if something goes wrong, we could issue the genie with a new command to stop or to reverse the effects of the previous actions, whereas a sovereign would just push on regardless of our protests. But this apparent safety advantage for the genie is largely illusory. The “stop” or “undo” button on a genie works only for benign failure modes: in the case of a malignant failure—one in which, for example, carrying out the existing command has become a final goal for the genie—the genie would simply disregard any subsequent attempt to countermand the previous command.
7

One option would be to try to build a genie such that it would automatically present the user with a prediction about salient aspects of the likely outcomes of a proposed command, asking for confirmation before proceeding. Such a system could be referred to as a
genie-with-a-preview
. But if this could be done for a genie, it could likewise be done for a sovereign. So again, this is not a clear differentiator between a genie and a sovereign. (Supposing that a preview functionality could be created, the questions of whether and if so how to use it are rather less obvious than one might think, notwithstanding the strong appeal of being able to glance at the outcome before committing to making it irrevocable reality. We will return to this matter later.)

Other books

Even the Moon Has Scars by Steph Campbell
All Involved by Ryan Gattis
The Hard Count by Ginger Scott
Dark Defender by Morgan, Alexis