The Knife Edge of Value Alignment in AI: Utopia or Extinction

A Long Now Boston Community Conversation with Richard Mallah and Lucas Perry, The Future of Life Institute [FLi]. What happens when we ask the algorithms to make decisions for us – decisions that may have life and death consequences?

A Long Now Boston Community Conversation with Richard Mallah and Lucas Perry from the The Future of Life Institute [FLi] on Value Alignment: The long-term future of the human race may well be determined in this century by Artificial Intelligence (AI).

Richard Mallah

Richard Mallah is the Director of AI Projects, Future of Life Institute, and has over fifteen years of experience leading AI research and AI product teams in industry.

Lucas Perry

Lucas Perry is a Project Coordinator for the Future of Life Institute, focused on enabling and delivering existential risk mitigation efforts.

Speakers

Subscribe to our mailing list

Join Long Now Boston

Event Summary

Introduction

The long-term future of the human race may well be determined in this century by Artificial Intelligence (AI). According to Richard Mallah, Director of AI Projects with the Future of Life Institute (FLI), the median view among AI researchers is that human-level General Intelligence (AGI) will be achieved by mid-century. Once that goal has been reached, many expect that AGI agents will be able to use their intelligence to improve AI functionality at an accelerating rate, quickly becoming Artificial Super-Intelligence (ASI) with capabilities far beyond those of humans.

The Knife Edge of Value Alignment

In the ideal, AGI and ASI will bring immense benefits to the human race and the world we live in. Human toil, deprivation and suffering could be virtually eliminated and humanity freed to pursue self-fulfilling interests and passions without worldly concerns. Yet this future could go badly wrong. Lucas Perry, Project Coordinator for FLI, described a worst-case scenario as a self-directed malevolent ASI that vastly increases human suffering or extinguishes the human race entirely to suit its own purposes.

As Richard and Lucas explained to a packed Long Now Boston audience at the Café ArtSience on Monday October 1st, the factors that will determine which future we inhabit can be characterized collectively as the value alignment problem. AI agents will be able to pursue goals that may be disengaged from what we value. Seeking to ensure those agents use their intelligence in a beneficial manner invokes a complex array of technical, cognitive and philosophical issues. FLI’s Value Alignment Map (see Figure) captures that complexity and serves as a tool for inventorying research that is being done or that should be done. Its seven main branches include Governance, Validation, Security, Control, Foundations, Verifications and Ethics.

Slippery Slopes in Programming AI

Among the many interesting and challenging problems being encountered in AI programming are convergent instrumental goals and reward hacking. An autonomous intelligent system with capacities for planning and learning may be given a fundamental end goal, but as it learns how to achieve that goal it will also develop a set of instrumental goals helpful to achieving the end goal. These instrumental goals, for example relating to self-preservation, or accumulation of information or resources, can diverge from human values and may result in negative consequences. For example, an AI system, such as the “paperclip maximizer” theorized by Nick Bostrom, might learn that human bodies are a great source of raw materials. An ASI developing an instrumental goal of harvesting human bodies would have disastrous implications.

Reward hacking occurs when an AI is able to achieve its end-goal with behaviors the programmer did not intend. For example, a robot taught to clean a house could learn to be more successful as a cleaner by first pouring dirt on the floor. As AI systems become more complex and more autonomous from their programmers, guarding against value misalignment from instrumental goals or reward hacking becomes more critical and vastly more challenging.

AI Ethics

Fundamental to the programming of AI systems for beneficence is the need to assure compliance with some level of ethical standards of behavior. The deontological or rule-based approach, such as Asimov’s Laws of Robotics (formulated more than 50 years ago), yields inconsistencies and loopholes and results in an unworkably complex layering of do’s and don’ts. Teaching an AI agent to mimic or follow human ethical norms is one approach, but it creates other problems. Not all AI programmers or owners are going to have particularly laudable ethical standards. If you ask AIs to learn ethical norms from a broader human population, this is going to expose the AI to conflicts and biases that humans have been unable to resolve. The Microsoft chatbot Tay was tasked to behave like a 14 year-old girl and pick up cultural cues from twitter interactions with humans – within a day she was spewing racist and misogynistic rant.

One hope is that we can ask advanced AI to do their own ethical reasoning, something they may be able to do more systematically than emotion-laden humans. Lucas pointed out, however, that this leads us quickly into a sea of questions about the ethical frame of reference and the meta-ethical epistemology that need to be instantiated in an AI system. These are questions philosophers have argued about for thousands of years. Yet there is a difference. Now, in the 21stcentury, we are on deadline. AGI and ASI may be only decades away.

Value Alignment Lessons from Value Alignment

The biggest take-away from the Long Now Boston AI conversation may be this. AI Value Alignment is an opportunity to examine the foundations of decision-making including the role of ethics. This can be done now, in a simulated environment that does not yet influence the human world. In addition, AI systems themselves may assist us in evaluating the consequential outcomes of alternative ethical choices. What we learn about AI decision-making and ethics may yield new knowledge that can be applied to the non-simulated value alignment problems of the human world. In the quest for optimal, ethically-guided AI, perhaps we will find practical ethical insights applicable to the human race.

Will AI teach us to be better humans?