Julia Haas (DeepMind): Publications

111

A Roadmap for Evaluating Moral Competence in Large Language Models
with Sophie Bridgers, Arianna Manzini, Benjamin Henke, Joshua May, Sydney Levine, Laura Weidinger, Murray Shanahan, Kristian Lum, Iason Gabriel, and William Isaac

Nature 650. 2026.

The question of whether large language models (LLMs) can exhibit moral capabilities is of growing interest and urgency, as these systems are deployed in sensitive roles such as companionship and medical advising, and will increasingly be tasked with making decisions and taking actions on behalf of humans. These trends require moving beyond evaluating for mere moral performance, the ability to produce morally appropriate outputs, to evaluating for moral competence, the ability to produce morally …Read more
The question of whether large language models (LLMs) can exhibit moral capabilities is of growing interest and urgency, as these systems are deployed in sensitive roles such as companionship and medical advising, and will increasingly be tasked with making decisions and taking actions on behalf of humans. These trends require moving beyond evaluating for mere moral performance, the ability to produce morally appropriate outputs, to evaluating for moral competence, the ability to produce morally appropriate outputs based on morally relevant considerations. Assessing moral competence is critical for predicting future model behaviour, establishing appropriate public trust and justifying moral attributions. However, both the unique architectures of LLMs and the complexity of morality itself introduce fundamental challenges. Here we identify three such challenges: the facsimile problem, whereby models may imitate reasoning without genuine understanding; moral multidimensionality, whereby moral decisions are influenced by a range of context-sensitive relevant moral and non-moral considerations; and moral pluralism, which demands a new standard for globally deployed artificial intelligence. We provide a roadmap for tackling these challenges, advocating for a suite of adversarial and confirmatory evaluations that will enable us to work towards a more scientifically grounded understanding and, in turn, a more responsible attribution of moral competence to LLMs.

Large Language Models Machine Ethics Moral Reasoning and Motivation, Misc
Mind Design III (edited book)
129

The Puzzle of Evaluating Moral Cognition in Artificial Agents
with Madeline G. Reinecke, Yiran Mao, Markus Kunesch, Edgar A. Duéñez-Guzmán, and Joel Z. Leibo

Cognitive Science 47 (8). 2023.

In developing artificial intelligence (AI), researchers often benchmark against human performance as a measure of progress. Is this kind of comparison possible for moral cognition? Given that human moral judgment often hinges on intangible properties like “intention” which may have no natural analog in artificial agents, it may prove difficult to design a “like‐for‐like” comparison between the moral behavior of artificial and human agents. What would a measure of moral behavior for both humans a…Read more
In developing artificial intelligence (AI), researchers often benchmark against human performance as a measure of progress. Is this kind of comparison possible for moral cognition? Given that human moral judgment often hinges on intangible properties like “intention” which may have no natural analog in artificial agents, it may prove difficult to design a “like‐for‐like” comparison between the moral behavior of artificial and human agents. What would a measure of moral behavior for both humans and AI look like? We unravel the complexity of this question by discussing examples within reinforcement learning and generative AI, and we examine how the puzzle of evaluating artificial agents' moral cognition remains open for further investigation within cognitive science.

Philosophy of Cognitive Science
Recovering Spinoza's theory of akrasia
In Ursula Goldenbaum & Christopher Kluz (eds.), Doing without Free Will: Spinoza and Contemporary Moral Problems, Lexington Books. 2015.

Free Will
2054

Reinforcement learning: A brief guide for philosophers of mind
Philosophy Compass 17 (9). 2022.

In this opinionated review, I draw attention to some of the contributions reinforcement learning can make to questions in the philosophy of mind. In particular, I highlight reinforcement learning's foundational emphasis on the role of reward in agent learning, and canvass two ways in which the framework may advance our understanding of perception and motivation.

Philosophy of Mind Philosophy of Cognitive Science
1355

The evaluative mind
In Mind Design III, . forthcoming.

I propose that the successes and contributions of reinforcement learning urge us to see the mind in a new light, namely, to recognise that the mind is fundamentally evaluative in nature.
1250

Is Synchronic Self-Control Possible?
Review of Philosophy and Psychology 12 (2): 397-424. 2020.

An agent exercises instrumental rationality to the degree that she adopts appropriate means to achieving her ends. Adopting appropriate means to achieving one’s ends can, in turn, involve overcoming one’s strongest desires, that is, it can involve exercising synchronic self-control. However, contra prominent approaches, I deny that synchronic self-control is possible. Specifically, I draw on computational models and empirical evidence from cognitive neuroscience to describe a naturalistic, multi…Read more
An agent exercises instrumental rationality to the degree that she adopts appropriate means to achieving her ends. Adopting appropriate means to achieving one’s ends can, in turn, involve overcoming one’s strongest desires, that is, it can involve exercising synchronic self-control. However, contra prominent approaches, I deny that synchronic self-control is possible. Specifically, I draw on computational models and empirical evidence from cognitive neuroscience to describe a naturalistic, multi-system model of the mind. On this model, synchronic self-control is impossible. Must we, then, give up on a meaningful conception of instrumental rationality? No. A multi-system view still permits something like synchronic self-control: an agent can control her very strong desires. Adopting a multi-system model of the mind thus places limitations on our conceptions of instrumental rationality, without requiring that we abandon the notion altogether.

Explanation of Action Philosophy of Psychology
1461

Can hierarchical predictive coding explain binocular rivalry?
Philosophical Psychology 34 (3): 424-444. 2021.

Hohwy et al.’s (2008) model of binocular rivalry (BR) is taken as a classic illustration of predictive coding’s explanatory power. I revisit the account and show that it cannot explain the role of reward in BR. I then consider a more recent version of Bayesian model averaging, which recasts the role of reward in (BR) in terms of optimism bias. If we accept this account, however, then we must reconsider our conception of perception. On this latter view, I argue, organisms engage in what amounts …Read more
Hohwy et al.’s (2008) model of binocular rivalry (BR) is taken as a classic illustration of predictive coding’s explanatory power. I revisit the account and show that it cannot explain the role of reward in BR. I then consider a more recent version of Bayesian model averaging, which recasts the role of reward in (BR) in terms of optimism bias. If we accept this account, however, then we must reconsider our conception of perception. On this latter view, I argue, organisms engage in what amounts to policy-driven, motivated perception.

Cognitive Psychology Issues in Psychology Explanation in Cognitive Science Perception
5018

The Neuroscience of Moral Judgment: Empirical and Philosophical Developments
with Joshua May, Clifford I. Workman, and Hyemin Han

In Felipe de Brigard & Walter Sinnott-Armstrong (eds.), Neuroscience and philosophy, The Mit Press. pp. 17-47. 2022.

We chart how neuroscience and philosophy have together advanced our understanding of moral judgment with implications for when it goes well or poorly. The field initially focused on brain areas associated with reason versus emotion in the moral evaluations of sacrificial dilemmas. But new threads of research have studied a wider range of moral evaluations and how they relate to models of brain development and learning. By weaving these threads together, we are developing a better understanding o…Read more
We chart how neuroscience and philosophy have together advanced our understanding of moral judgment with implications for when it goes well or poorly. The field initially focused on brain areas associated with reason versus emotion in the moral evaluations of sacrificial dilemmas. But new threads of research have studied a wider range of moral evaluations and how they relate to models of brain development and learning. By weaving these threads together, we are developing a better understanding of the neurobiology of moral judgment in adulthood and to some extent in childhood and adolescence. Combined with rigorous evidence from psychology and careful philosophical analysis, neuroscientific evidence can even help shed light on the extent of moral knowledge and on ways to promote healthy moral development.

Neuroscience of Ethics Developmental Psychology Moral Education Moral Epistemology Psychology of Ethics
120

Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition
Minds and Machines 30 (2): 219-246. 2020.

I describe a suite of reinforcement learning environments in which artificial agents learn to value and respond to moral content and contexts. I illustrate the core principles of the framework by characterizing one such environment, or “gridworld,” in which an agent learns to trade-off between monetary profit and fair dealing, as applied in a standard behavioral economic paradigm. I then highlight the core technical and philosophical advantages of the learning approach for modeling moral cogniti…Read more
I describe a suite of reinforcement learning environments in which artificial agents learn to value and respond to moral content and contexts. I illustrate the core principles of the framework by characterizing one such environment, or “gridworld,” in which an agent learns to trade-off between monetary profit and fair dealing, as applied in a standard behavioral economic paradigm. I then highlight the core technical and philosophical advantages of the learning approach for modeling moral cognition, and for addressing the so-called value alignment problem in AI.

Philosophy of Artificial Intelligence
224

An empirical solution to the puzzle of weakness of will
Synthese (12): 1-21. 2018.

This paper presents an empirical solution to the puzzle of weakness of will. Specifically, it presents a theory of action, grounded in contemporary cognitive neuroscientific accounts of decision making, that explains the phenomenon of weakness of will without resulting in a puzzle.

Motivation Weakness of Will

Julia Haas

A Roadmap for Evaluating Moral Competence in Large Language Models
with Sophie Bridgers, Arianna Manzini, Benjamin Henke, Joshua May, Sydney Levine, Laura Weidinger, Murray Shanahan, Kristian Lum, Iason Gabriel, and William Isaac

Nature 650. 2026.

Mind Design III (edited book)

The Puzzle of Evaluating Moral Cognition in Artificial Agents
with Madeline G. Reinecke, Yiran Mao, Markus Kunesch, Edgar A. Duéñez-Guzmán, and Joel Z. Leibo

Cognitive Science 47 (8). 2023.

Recovering Spinoza's theory of akrasia
In Ursula Goldenbaum & Christopher Kluz (eds.), Doing without Free Will: Spinoza and Contemporary Moral Problems, Lexington Books. 2015.

Reinforcement learning: A brief guide for philosophers of mind
Philosophy Compass 17 (9). 2022.

The evaluative mind
In Mind Design III, . forthcoming.

Is Synchronic Self-Control Possible?
Review of Philosophy and Psychology 12 (2): 397-424. 2020.

Can hierarchical predictive coding explain binocular rivalry?
Philosophical Psychology 34 (3): 424-444. 2021.

The Neuroscience of Moral Judgment: Empirical and Philosophical Developments
with Joshua May, Clifford I. Workman, and Hyemin Han

In Felipe de Brigard & Walter Sinnott-Armstrong (eds.), Neuroscience and philosophy, The Mit Press. pp. 17-47. 2022.

Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition
Minds and Machines 30 (2): 219-246. 2020.

An empirical solution to the puzzle of weakness of will
Synthese (12): 1-21. 2018.

Julia Haas

A Roadmap for Evaluating Moral Competence in Large Language Models with Sophie Bridgers, Arianna Manzini, Benjamin Henke, Joshua May, Sydney Levine, Laura Weidinger, Murray Shanahan, Kristian Lum, Iason Gabriel, and William Isaac Nature 650. 2026.

Mind Design III (edited book)

The Puzzle of Evaluating Moral Cognition in Artificial Agents with Madeline G. Reinecke, Yiran Mao, Markus Kunesch, Edgar A. Duéñez-Guzmán, and Joel Z. Leibo Cognitive Science 47 (8). 2023.

Recovering Spinoza's theory of akrasia In Ursula Goldenbaum & Christopher Kluz (eds.), Doing without Free Will: Spinoza and Contemporary Moral Problems, Lexington Books. 2015.

Reinforcement learning: A brief guide for philosophers of mind Philosophy Compass 17 (9). 2022.

The evaluative mind In Mind Design III, . forthcoming.

Is Synchronic Self-Control Possible? Review of Philosophy and Psychology 12 (2): 397-424. 2020.

Can hierarchical predictive coding explain binocular rivalry? Philosophical Psychology 34 (3): 424-444. 2021.

The Neuroscience of Moral Judgment: Empirical and Philosophical Developments with Joshua May, Clifford I. Workman, and Hyemin Han In Felipe de Brigard & Walter Sinnott-Armstrong (eds.), Neuroscience and philosophy, The Mit Press. pp. 17-47. 2022.

Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition Minds and Machines 30 (2): 219-246. 2020.

An empirical solution to the puzzle of weakness of will Synthese (12): 1-21. 2018.

A Roadmap for Evaluating Moral Competence in Large Language Models
with Sophie Bridgers, Arianna Manzini, Benjamin Henke, Joshua May, Sydney Levine, Laura Weidinger, Murray Shanahan, Kristian Lum, Iason Gabriel, and William Isaac

Nature 650. 2026.

The Puzzle of Evaluating Moral Cognition in Artificial Agents
with Madeline G. Reinecke, Yiran Mao, Markus Kunesch, Edgar A. Duéñez-Guzmán, and Joel Z. Leibo

Cognitive Science 47 (8). 2023.

Recovering Spinoza's theory of akrasia
In Ursula Goldenbaum & Christopher Kluz (eds.), Doing without Free Will: Spinoza and Contemporary Moral Problems, Lexington Books. 2015.

Reinforcement learning: A brief guide for philosophers of mind
Philosophy Compass 17 (9). 2022.

The evaluative mind
In Mind Design III, . forthcoming.

Is Synchronic Self-Control Possible?
Review of Philosophy and Psychology 12 (2): 397-424. 2020.

Can hierarchical predictive coding explain binocular rivalry?
Philosophical Psychology 34 (3): 424-444. 2021.

The Neuroscience of Moral Judgment: Empirical and Philosophical Developments
with Joshua May, Clifford I. Workman, and Hyemin Han

In Felipe de Brigard & Walter Sinnott-Armstrong (eds.), Neuroscience and philosophy, The Mit Press. pp. 17-47. 2022.

Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition
Minds and Machines 30 (2): 219-246. 2020.

An empirical solution to the puzzle of weakness of will
Synthese (12): 1-21. 2018.