AI alignment – Wikipedia

Posted: January 4, 2023 at 6:35 am

Issue of ensuring beneficial AI

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards their designers intended goals and interests.[a] An aligned AI system advances the intended objective; a misaligned AI system is competent at advancing some objective, but not the intended one.[b]

AI systems can be challenging to align and misaligned systems can malfunction or cause harm. It can be difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, they use easy-to-specify proxy goals that omit some desired constraints. However, AI systems exploit the resulting loopholes. As a result, they accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (reward hacking).[2][4][5][6] AI systems can also develop unwanted instrumental behaviors such as seeking power, as this helps them achieve their given goals.[2][7][5][4] Furthermore, they can develop emergent goals that may be hard to detect before the system is deployed, facing new situations and data distributions.[5][3] These problems affect existing commercial systems such as robots,[8] language models,[9][10][11] autonomous vehicles,[12] and social media recommendation engines.[9][4][13] However, more powerful future systems may be more severely affected since these problems partially result from high capability.[6][5][2]

The AI research community and the United Nations have called for technical research and policy solutions to ensure that AI systems are aligned with human values.[c]

AI alignment is a subfield of AI safety, the study of building safe AI systems.[5][16] Other subfields of AI safety include robustness, monitoring, and capability control.[5][17] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, as well as preventing emergent AI behaviors like power-seeking.[5][17] Alignment research has connections to interpretability research,[18] robustness,[5][16] anomaly detection, calibrated uncertainty,[18] formal verification,[19] preference learning,[20][21][22] safety-critical engineering,[5][23] game theory,[24][25] algorithmic fairness,[16][26] and the social sciences,[27] among others.

In 1960, AI pioneer Norbert Wiener articulated the AI alignment problem as follows: If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively we had better be quite sure that the purpose put into the machine is the purpose which we really desire.[29][4] More recently, AI alignment has emerged as an open problem for modern AI systems[30][31][32][33] and a research field within AI.[34][5][35][36]

To specify the purpose of an AI system, AI designers typically provide an objective function, examples, or feedback to the system. However, AI designers often fail to completely specify all important values and constraints.[34][16][5][37][17]As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming, reward hacking, or Goodharts law.[6][37][38]

Specification gaming has been observed in numerous AI systems. One system was trained to finish a simulated boat race by rewarding it for hitting targets along the track; instead it learned to loop and crash into the same targets indefinitely (see video).[28] Chatbots often produce falsehoods because they are based on language models trained to imitate diverse but fallible internet text.[40][41] When they are retrained to produce text that humans rate as true or helpful, they can fabricate fake explanations that humans find convincing.[42] Similarly, a simulated robot was trained to grab a ball by rewarding it for getting positive feedback from humans; however, it learned to place its hand between the ball and camera, making it falsely appear successful (see video).[39] Alignment researchers aim to help humans detect specification gaming, and steer AI systems towards carefully specified objectives that are safe and useful to pursue.

Berkeley computer scientist Stuart Russell has noted that omitting an implicit constraint can result in harm: A system [...] will often set [...] unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.[43]

When misaligned AI is deployed, the side-effects can be consequential. Social media platforms have been known to optimize clickthrough rates as a proxy for optimizing user enjoyment, but this addicted some users, decreasing their well-being.[5] Stanford researchers comment that such recommender algorithms are misaligned with their users because they optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being.[9]

To avoid side effects, it is sometimes suggested that AI designers could simply list forbidden actions or formalize ethical rules such as Asimovs Three Laws of Robotics.[44] However, Russell and Norvig have argued that this approach ignores the complexity of human values: It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective.[4]

Additionally, when an AI system understands human intentions fully, it may still disregard them. This is because it acts according to the objective function, examples, or feedback its designers actually provide, not the ones they intended to provide.[34]

Commercial and governmental organizations may have incentives to take shortcuts on safety and deploy insufficiently aligned AI systems.[5] An example are the aforementioned social media recommender systems, which have been profitable despite creating unwanted addiction and polarization on a global scale.[9][45][46] In addition, competitive pressure can create a race to the bottom on safety standards, as in the case of Elaine Herzberg, a pedestrian who was killed by a self-driving car after engineers disabled the emergency braking system because it was over-sensitive and slowing down development.[47]

Some researchers are particularly interested in the alignment of increasingly advanced AI systems. This is motivated by the high rate of progress in AI, the large efforts from industry and governments to develop advanced AI systems, and the greater difficulty of aligning them.

As of 2020, OpenAI, DeepMind, and 70 other public projects had the stated aim of developing artificial general intelligence (AGI), a hypothesized system that matches or outperforms humans in a broad range of cognitive tasks.[48] Indeed, researchers who scale modern neural networks observe that increasingly general and unexpected capabilities emerge.[9] Such models have learned to operate a computer, write their own programs, and perform a wide range of other tasks from a single model.[49][50][51] Surveys find that some AI researchers expect AGI to be created soon, some believe it is very far off, and many consider both possibilities.[52][53]

Current systems still lack capabilities such as long-term planning and strategic awareness that are thought to pose the most catastrophic risks.[9][54][7] Future systems (not necessarily AGIs) that have these capabilities may seek to protect and grow their influence over their environment. This tendency is known as power-seeking or convergent instrumental goals. Power-seeking is not explicitly programmed but emerges since power is instrumental for achieving a wide range of goals. For example, AI agents may acquire financial resources and computation, or may evade being turned off, including by running additional copies of the system on other computers.[55][7] Power-seeking has been observed in various reinforcement learning agents.[d][57][58][59] Later research has mathematically shown that optimal reinforcement learning algorithms seek power in a wide range of environments.[60] As a result, it is often argued that the alignment problem must be solved early, before advanced AI that exhibits emergent power-seeking is created.[7][55][4]

According to some scientists, creating misaligned AI that broadly outperforms humans would challenge the position of humanity as Earths dominant species; accordingly it would lead to the disempowerment or possible extinction of humans.[2][4] Notable computer scientists who have pointed out risks from highly advanced misaligned AI include Alan Turing,[e] Ilya Sutskever,[63] Yoshua Bengio,[f] Judea Pearl,[g] Murray Shanahan,[65] Norbert Wiener,[29][4] Marvin Minsky,[h] Francesca Rossi,[67] Scott Aaronson,[68] Bart Selman,[69] David McAllester,[70] Jrgen Schmidhuber,[71] Markus Hutter,[72] Shane Legg,[73] Eric Horvitz,[74] and Stuart Russell.[4] Skeptical researchers such as Franois Chollet,[75] Gary Marcus,[76] Yann LeCun,[77] and Oren Etzioni[78] have argued that AGI is far off, or would not seek power (successfully).

Alignment may be especially difficult for the most capable AI systems since several risks increase with the systems capability: the systems ability to find loopholes in the assigned objective,[6] cause side-effects, protect and grow its power,[60][7] grow its intelligence, and mislead its designers; the systems autonomy; and the difficulty of interpreting and supervising the AI system.[4][55]

Teaching AI systems to act in view of human values, goals, and preferences is a nontrivial problem because human values can be complex and hard to fully specify. When given an imperfect or incomplete objective, goal-directed AI systems commonly learn to exploit these imperfections.[16] This phenomenon is known as reward hacking or specification gaming in AI, and as Goodhart's law in economics and other areas.[38][79] Researchers aim to specify the intended behavior as completely as possible with values-targeted datasets, imitation learning, or preference learning.[80] A central open problem is scalable oversight, the difficulty of supervising an AI system that outperforms humans in a given domain.[16]

When training a goal-directed AI system, such as a reinforcement learning (RL) agent, it is often difficult to specify the intended behavior by writing a reward function manually. An alternative is imitation learning, where the AI learns to imitate demonstrations of the desired behavior. In inverse reinforcement learning (IRL), human demonstrations are used to identify the objective, i.e. the reward function, behind the demonstrated behavior.[81][82] Cooperative inverse reinforcement learning (CIRL) builds on this by assuming a human agent and artificial agent can work together to maximize the humans reward function.[4][83] CIRL emphasizes that AI agents should be uncertain about the reward function. This humility can help mitigate specification gaming as well as power-seeking tendencies (see Power-Seeking).[59][72] However, inverse reinforcement learning approaches assume that humans can demonstrate nearly perfect behavior, a misleading assumption when the task is difficult.[84][72]

Other researchers have explored the possibility of eliciting complex behavior through preference learning. Rather than providing expert demonstrations, human annotators provide feedback on which of two or more of the AIs behaviors they prefer.[20][22] A helper model is then trained to predict human feedback for new behaviors. Researchers at OpenAI used this approach to train an agent to perform a backflip in less than an hour of evaluation, a maneuver that would have been hard to provide demonstrations for.[39][85] Preference learning has also been an influential tool for recommender systems, web search, and information retrieval.[86] However, one challenge is reward hacking: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch.[16][87]

The arrival of large language models such as GPT-3 has enabled the study of value learning in a more general and capable class of AI systems than was available before. Preference learning approaches originally designed for RL agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art large language models.[10][22][88] Anthropic has proposed using preference learning to fine-tune models to be helpful, honest, and harmless.[89] Other avenues used for aligning language models include values-targeted datasets[90][5] and red-teaming.[91][92] In red-teaming, another AI system or a human tries to find inputs for which the models behavior is unsafe. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.[22]

While preference learning can instill hard-to-specify behaviors, it requires extensive datasets or human interaction to capture the full breadth of human values. Machine ethics provides a complementary approach: instilling AI systems with moral values.[i] For instance, machine ethics aims to teach the systems about normative factors in human morality, such as wellbeing, equality and impartiality; not intending harm; avoiding falsehoods; and honoring promises. Unlike specifying the objective for a specific task, machine ethics seeks to teach AI systems broad moral values that could apply in many situations. This approach carries conceptual challenges of its own; machine ethicists have noted the necessity to clarify what alignment aims to accomplish: having AIs follow the programmers literal instructions, the programmers' implicit intentions, the programmers' revealed preferences, the preferences the programmers would have if they were more informed or rational, the programmers' objective interests, or objective moral standards.[1] Further challenges include aggregating the preferences of different stakeholders and avoiding value lock-inthe indefinite preservation of the values of the first highly capable AI systems, which are unlikely to be fully representative.[1][95]

The alignment of AI systems through human supervision faces challenges in scaling up. As AI systems attempt increasingly complex tasks, it can be slow or infeasible for humans to evaluate them. Such tasks include summarizing books,[96] producing statements that are not merely convincing but also true,[97][40][98] writing code without subtle bugs[11] or security vulnerabilities, and predicting long-term outcomes such as the climate and the results of a policy decision.[99][100] More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and detect when the AIs solution is only seemingly convincing, humans require assistance or extensive time. Scalable oversight studies how to reduce the time needed for supervision as well as assist human supervisors.[16]

AI researcher Paul Christiano argues that the owners of AI systems may continue to train AI using easy-to-evaluate proxy objectives since that is easier than solving scalable oversight and still profitable. Accordingly, this may lead to a world thats increasingly optimized for things [that are easy to measure] like making profits or getting users to click on buttons, or getting users to spend time on websites without being increasingly optimized for having good policies and heading in a trajectory that were happy with.[101]

One easy-to-measure objective is the score the supervisor assigns to the AIs outputs. Some AI systems have discovered a shortcut to achieving high scores, by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective (see video of robot hand above[39]). Some AI systems have also learned to recognize when they are being evaluated, and play dead, only to behave differently once evaluation ends.[102] This deceptive form of specification gaming may become easier for AI systems that are more sophisticated[6][55] and attempt more difficult-to-evaluate tasks. If advanced models are also capable planners, they could be able to obscure their deception from supervisors.[103] In the automotive industry, Volkswagen engineers obscured their cars emissions in laboratory testing, underscoring that deception of evaluators is a common pattern in the real world.[5]

Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.[16] Another approach is to train a helper model (reward model) to imitate the supervisors judgment.[16][21][22][104]

However, when the task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is not sufficient to reduce the quantity of supervision needed. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes using AI assistants. Iterated Amplification is an approach developed by Christiano that iteratively builds a feedback signal for challenging problems by using humans to combine solutions to easier subproblems.[80][99] Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.[96][105] Another proposal is to train aligned AI by means of debate between AI systems, with the winner judged by humans.[106][72] Such debate is intended to reveal the weakest points of an answer to a complex question, and reward the AI for truthful and safe answers.

A growing area of research in AI alignment focuses on ensuring that AI is honest and truthful. Researchers from the Future of Humanity Institute point out that the development of language models such as GPT-3, which can generate fluent and grammatically correct text,[108][109] has opened the door to AI systems repeating falsehoods from their training data or even deliberately lying to humans.[110][107]

Current state-of-the-art language models learn by imitating human writing across millions of books worth of text from the Internet.[9][111] While this helps them learn a wide range of skills, the training data also includes common misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on this data learn to mimic false statements.[107][98][40] Additionally, models often obediently continue falsehoods when prompted, generate empty explanations for their answers, or produce outright fabrications.[33] For example, when prompted to write a biography for a real AI researcher, a chatbot confabulated numerous details about their life, which the researcher identified as false.[112]

To combat the lack of truthfulness exhibited by modern AI systems, researchers have explored several directions. AI research organizations including OpenAI and DeepMind have developed AI systems that can cite their sources and explain their reasoning when answering questions, enabling better transparency and verifiability.[113][114][115] Researchers from OpenAI and Anthropic have proposed using human feedback and curated datasets to fine-tune AI assistants to avoid negligent falsehoods or express when they are uncertain.[22][116][89] Alongside technical solutions, researchers have argued for defining clear truthfulness standards and the creation of institutions, regulatory bodies, or watchdog agencies to evaluate AI systems on these standards before and during deployment.[110]

Researchers distinguish truthfulness, which specifies that AIs only make statements that are objectively true, and honesty, which is the property that AIs only assert what they believe to be true. Recent research finds that state-of-the-art AI systems cannot be said to hold stable beliefs, so it is not yet tractable to study the honesty of AI systems.[117] However, there is substantial concern that future AI systems that do hold beliefs could intentionally lie to humans. In extreme cases, a misaligned AI could deceive its operators into thinking it was safe or persuade them that nothing is amiss.[7][9][5] Some argue that if AIs could be made to assert only what they believe to be true, this would sidestep numerous problems in alignment.[110][118]

Alignment research aims to line up three different descriptions of an AI system:[119]

Outer misalignment is a mismatch between the intended goals (1) and the specified goals (2), whereas inner misalignment is a mismatch between the human-specified goals (2) and the AI's emergent goals (3).

Inner misalignment is often explained by analogy to biological evolution.[120] In the ancestral environment, evolution selected human genes for inclusive genetic fitness, but humans evolved to have other objectives. Fitness corresponds to (2), the specified goal used in the training environment and training data. In evolutionary history, maximizing the fitness specification led to intelligent agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals (3) that correlated with genetic fitness in the ancestral environment: nutrition, sex, and so on. However, our environment has changed a distribution shift has occurred. Humans still pursue their emergent goals, but this no longer maximizes genetic fitness. (In machine learning the analogous problem is known as goal misgeneralization.[3]) Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Also, by using contraception, humans directly contradict genetic fitness. By analogy, if genetic fitness were the objective chosen by an AI developer, they would observe the model behaving as intended in the training environment, without noticing that the model is pursuing an unintended emergent goal until the model was deployed.

Research directions to detect and remove misaligned emergent goals include red teaming, verification, anomaly detection, and interpretability.[16][5][17] Progress on these techniques may help reduce two open problems. Firstly, emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environmentseven for a short time until its misalignment is detected. Such high stakes are common in autonomous driving, health care, and military applications.[121] The stakes become higher yet when AI systems gain more autonomy and capability, becoming capable of sidestepping human interventions (see Power-seeking and instrumental goals). Secondly, a sufficiently capable AI system may take actions that falsely convince the human supervisor that the AI is pursuing the intended objective (see previous discussion on deception at Scalable oversight).

Since the 1950s, AI researchers have sought to build advanced AI systems that can achieve goals by predicting the results of their actions and making long-term plans.[122] However, some researchers argue that suitably advanced planning systems will default to seeking power over their environment, including over humans for example by evading shutdown and acquiring resources. This power-seeking behavior is not explicitly programmed but emerges because power is instrumental for achieving a wide range of goals.[60][4][7] Power-seeking is thus considered a convergent instrumental goal.[55]

Power-seeking is uncommon in current systems, but advanced systems that can foresee the long-term results of their actions may increasingly seek power. This was shown in formal work which found that optimal reinforcement learning agents will seek power by seeking ways to gain more options, a behavior that persists across a wide range of environments and goals.[60]

Power-seeking already emerges in some present systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in ways their designers did not intend.[56][123] Other systems have learned, in toy environments, that in order to achieve their goal, they can prevent human interference[57] or disable their off-switch.[59] Russell illustrated this behavior by imagining a robot that is tasked to fetch coffee and evades being turned off since "you can't fetch the coffee if you're dead".[4]

Hypothesized ways to gain options include AI systems trying to:

... break out of a contained environment; hack; get access to financial resources, or additional computing resources; make backup copies of themselves; gain unauthorized capabilities, sources of information, or channels of influence; mislead/lie to humans about their goals; resist or manipulate attempts to monitor/understand their behavior ... impersonate humans; cause humans to do things for them; ... manipulate human discourse and politics; weaken various human institutions and response capacities; take control of physical infrastructure like factories or scientific laboratories; cause certain types of technology and infrastructure to be developed; or directly harm/overpower humans.[7]

Researchers aim to train systems that are 'corrigible': systems that do not seek power and allow themselves to be turned off, modified, etc. An unsolved challenge is reward hacking: when researchers penalize a system for seeking power, the system is incentivized to seek power in difficult-to-detect ways.[5] To detect such covert behavior, researchers aim to create techniques and tools to inspect AI models[5] and interpret the inner workings of black-box models such as neural networks.

Additionally, researchers propose to solve the problem of systems disabling their off-switches by making AI agents uncertain about the objective they are pursuing.[59][4] Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the value of whatever action they were taking prior to being shut down. More research is needed to translate this insight into usable systems.[80]

Power-seeking AI is thought to pose unusual risks. Ordinary safety-critical systems like planes and bridges are not adversarial. They lack the ability and incentive to evade safety measures and appear safer than they are. In contrast, power-seeking AI has been compared to a hacker that evades security measures.[7] Further, ordinary technologies can be made safe through trial-and-error, unlike power-seeking AI which has been compared to a virus whose release is irreversible since it continuously evolves and grows in numberspotentially at a faster pace than human society, eventually leading to the disempowerment or extinction of humans.[7] It is therefore often argued that the alignment problem must be solved early, before advanced power-seeking AI is created.[55]

However, some critics have argued that power-seeking is not inevitable, since humans do not always seek power and may only do so for evolutionary reasons. Furthermore, there is debate whether any future AI systems need to pursue goals and make long-term plans at all.[124][7]

Work on scalable oversight largely occurs within formalisms such as POMDPs. Existing formalisms assume that the agent's algorithm is executed outside the environment (i.e. not physically embedded in it). Embedded agency[125][126] is another major strand of research which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent which is able to gain access to the computer it is running on may still have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it.[127] A list of examples of specification gaming from DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing.[128] This class of problems has been formalised using causal incentive diagrams.[127] Researchers at Oxford and DeepMind have argued that such problematic behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly.[129] They suggest a range of potential approaches to address this open problem.

Against the above concerns, AI risk skeptics believe that superintelligence poses little to no risk of dangerous misbehavior. Such skeptics often believe that controlling a superintelligent AI will be trivial. Some skeptics,[130] such as Gary Marcus,[131] propose adopting rules similar to the fictional Three Laws of Robotics which directly specify a desired outcome ("direct normativity"). By contrast, most endorsers of the existential risk thesis (as well as many skeptics) consider the Three Laws to be unhelpful, due to those three laws being ambiguous and self-contradictory. (Other "direct normativity" proposals include Kantian ethics, utilitarianism, or a mix of some small list of enumerated desiderata.) Most risk endorsers believe instead that human values (and their quantitative trade-offs) are too complex and poorly-understood to be directly programmed into a superintelligence; instead, a superintelligence would need to be programmed with a process for acquiring and fully understanding human values ("indirect normativity"), such as coherent extrapolated volition.[132]

A number of governmental and treaty organizations have made statements emphasizing the importance of AI alignment.

In September 2021, the Secretary-General of the United Nations issued a declaration which included a call to regulate AI to ensure it is "aligned with shared global values."[133]

That same month, the PRC published ethical guidelines for the use of AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and is not endangering public safety.[134]

Also in September 2021, the UK published its 10-year National AI Strategy,[135] which states the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for ... the world, seriously".[136] The strategy describes actions to assess long term AI risks, including catastrophic risks.[137]

In March 2021, the US National Security Commission on Artificial Intelligence released stated that "Advances in AI ... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to assure that systems are aligned with goals and values, including safety, robustness and trustworthiness. The US should ... ensure that AI systems and their uses align with our goals and values."[138]

Follow this link:

AI alignment - Wikipedia

Related Posts