Write-up of Ethics in AI Colloquium – Two Mistakes in AI Design?

Photo collage of speakers and attendees of the Ethics in AI Colloquium Two Mistakes in AI Design
Photo credit: Oxford Atelier

Write-up by Michael Cheng, Doctoral Candidate in Computer Science at the Institute for Ethics in AI. 

On 13 February 2025, the Institute welcomed Professor Ruth Chang, Oxford’s Chair of Jurisprudence at the "Ethics in AI Colloquium - Two Mistakes in AI Design?". During the colloquium, Professor Chang described what she believes are “two fundamental mistakes about human values embedded in AI design.”

Professor Chang began by outlining four major clusters of open problems in AI design: learning, reasoning, safety, control, and alignment. Chang focused on alignment: the problem of ensuring that machines align with human values. Chang argued that alignment is the most important problem because alignment is either a precondition or a means to solving the other three problems: a machine doesn’t count as learning or reasoning well if it learns to do things, we judge to be bad or reasons to evil conclusions, and aligning AI outputs with human values and judgments provides safety and control.

Chang provided a philosophical inquiry into the goals of alignment. She described several questionable assumptions about the alignment problem. First, setting a goal of always aligning a machine’s purpose with a human’s purpose is well-intentioned, but we know that some purposes can’t or shouldn’t be aimed at. Consider the paradox of happiness: even if our purpose is to be happy, aiming for happiness doesn’t always achieve happiness. Second, AI alignment is not just about making machines match our preferences. Preferences are one thing, and values are another. 

Professor Chang continued by describing two current approaches to alignment that have fallen short: regulation and “development interventions.” Although regulation is well-meaning, Professor Chang argued that regulation alone cannot achieve alignment—regulation takes place in a rapidly shifting landscape of practices, norms, institutions, and technological change. Regulation focuses on risk, but it’s tough to anticipate all of the risks of AI. Many technologists attempt to achieve alignment through development interventions, such as reinforcement learning with human feedback. But Professor Chang argued that design interventions currently do not work terribly well, and it’s not inevitable that using more data and compute will achieve alignment. According to Professor Chang, regulation and development interventions will fail to achieve alignment because they are “ad hoc guardrails” that cannot prevent misaligned AI from being created in the first place.

Professor Chang argued that AI has been designed based on two fundamentally mistaken assumptions about human values, and that these mistakes must be addressed to achieve alignment. Chang hoped that future attempts to achieve AI alignment will take these “mistakes” into account.

The first mistake is found in the covering problem: how do we get AI to cover all and only the purposes for which we design AI? Stuart Russell has declared that in AI design, you get what you ask for, not necessarily what you want. It’s very difficult to put into a machine purposes that cover all and only the purposes for which you design it. To solve this problem, many technologists have made a critical assumption that Chang called “values proxy”: that you can always achieve an evaluative end V by getting the best prior—by pursuing a non-evaluative proxy P across a wide range of circumstances. For instance, a hiring algorithm might look for the best candidate for a job by finding a close match with the CVs of the people already working there. But that non-evaluative proxy for job performance might not work, downgrading some applicants simply because they do not demographically resemble those who were already hired. Professor Chang argued that there were axiological reasons that non-evaluative proxies could not approximate values across different circumstances.

The first mistake of AI design is to assume that we can use non-evaluative proxies for our evaluative goals and purposes across a wide range of circumstances. Such proxies will necessarily fail to cover the evaluative aims and purposes for which we design AI in the first place. In this way, current AI design guarantees long-term value misalignment, even if current AI design can achieve impressive short-term results. Professor Chang argued that technologists should develop “value-based” AI that (mostly) eschews non-evaluative proxies in favour of processing evaluative facts themselves. Chang also argued that value-based AI would have to be “small AI” if it was to achieve alignment. Chang argued that we should be developing specialised value-based AI, not according to specific practical tasks we want accomplished, but according to divisions as to what counts as a distinct normative problem from what.

The second mistake is found in the trade-off problem. We typically have multiple purposes for which we design our machines, but there are trade-offs between satisfying them. We want to find the best hire, but that involves trading off evaluative qualities like productivity, reliability, and moral goodness. Professor Chang observed that many technologists attempt to deal with the trade-off problem by either finding proxies for the values we care about (e.g. use “similar in CV to the people we’ve already hired and are doing well” as a proxy for trade-offs between productivity and reliability) or by recognising that there are multiple criteria that determine outputs, assigning weights to those criteria, evaluating and sampling the outputs, and adjusting the weights until achieving an acceptable result. 

Professor Chang argued that current approaches to the trade-off problem make a false assumption about the structure of value: they do not recognise “hard choices.” It’s widely assumed (especially in quantitative fields) that when comparing two things, either one thing is better than another or the two things are equal. But in reality, humans face many hard choices, such as whether to eat cereal or a doughnut, whether or not to have kids, or whether to become a lawyer or philosopher. According to Chang, there is no mathematical formula to decide which choice is best in hard choices. There is a fourth value relation between such choices: they are “on a par.” The choices can be compared, each can be better than the other in some ways, and they can be in the same neighbourhood of whatever matters in the choice between them, but it the usual trichotomy of relations – better, worse, and equal – doesn’t hold between them.

If you ask a machine whether to declare whether A or B is better, the machine either chooses A, B, or declares that A and B are equal. But humans make hard choices in different ways. In hard choices, humans commit or drift. Humans commit by putting their very selves behind a feature of an option, thereby making that option more valuable than it was before. By committing to Harry, you make him the best person with whom to spend your life. Humans also drift in a hard choice between options. You can drift into a relationship with Harry, intentionally choose to be with him, without committing to making a life with him. Many people drift into love relationships without commitment. For these people, Harry is not better, but on a par with Tom or Dick. According to Professor Chang, what we do in hard choices crucially determines what values we have going forward. This ability to make it true that we have most reason to do this rather than that throughout commitments in hard choices is a central part of human agency and something that machines can never replicate. While machines can predict how we would commit in a hard choice, without the human actually committing, machine and human values will become misaligned.

According to Chang, humans face hard choices all the time, and machines that align with human values should face them, too. Since alignment with human values requires human input in hard choices, recognising hard choices puts the human in the loop of machine processing in a distinctive way.