Calum McNamara (Indiana University, Bloomington): Publications

22

Emergent Alignment and the Projectability of Ethical Personas
with Guillermo Del Pinal, Youngchan Lee, and Alejandro Pérez Carballo

Recent work on ‘emergent misalignment’ has shown that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the ‘persona selection’ (PSM) hypothesis that, during pre-training, LLMs learn to simulate many different characters and perspectives, which can then be elicited and refined during post-training. Inspired by those results, this paper investigates the converse phenomenon, ‘emergent alignment’, and uses it to support and refine the PSM and motivate a novel des…Read more
Recent work on ‘emergent misalignment’ has shown that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the ‘persona selection’ (PSM) hypothesis that, during pre-training, LLMs learn to simulate many different characters and perspectives, which can then be elicited and refined during post-training. Inspired by those results, this paper investigates the converse phenomenon, ‘emergent alignment’, and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the ‘Constitutional AI’ (CAI) approach and use four constitutions drawn from ethical systems that could be part of reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to and concerned solely with the good of humanity. For each of those models, we show that fine-tuning on two narrow safety sub-categories (harassment and illegal behaviors) reliably induces emergent alignment. Specifically, the narrowly aligned models perform significantly better than the helpful-only source model on a benchmark covering a representative sample of general safety categories, and on specific safety categories that were carefully filtered-out of the data sets used for narrow alignment finetuning. To test the ‘PSM’ using a more fine-grained evaluation, we also use a multidimensional persona-diagnostic which included dimensions for deontological, consequentialist, virtue-ethical, and “defer-to-authorities” ethical personas. For each constitutionally finetuned (broad and narrow) model, we evaluate how well their behavior matches their expected signature profile (given their anchor constitution). Our results show that our CAI models acquire their expected “ethical persona”—e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. At the same time, both our coarse and fine-grained evaluations show that there are significant differences across our (broad and narrow finetuned) CAI models in how well they project. Based on those results, we argue that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

Philosophy of AI, General Works
14

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents
with Arghal Raghu, Fade Chen, Niall Dalton, Evgenii Kortukov, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, and Mario Giulianelli

Proceedings of the 43Rd International Conference on Machine Learning (Icml). forthcoming.

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying g…Read more
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.
52

Choice and Credence in Context
Dissertation, University of Michigan, Ann Arbor. 2024.

This dissertation is about the role that conditionals play in uncertain reasoning and deliberation. Specifically, I attempt to show that, by appealing to a particular semantics for conditionals---a contexutalist, sequence semantics, which has recently become popular in philosophy of language---several open problems in decision theory and epistemology can be solved. Chapter 1 is introductory. I set out the semantic view of conditionals in question, and I describe some of its historical backgroun…Read more
This dissertation is about the role that conditionals play in uncertain reasoning and deliberation. Specifically, I attempt to show that, by appealing to a particular semantics for conditionals---a contexutalist, sequence semantics, which has recently become popular in philosophy of language---several open problems in decision theory and epistemology can be solved. Chapter 1 is introductory. I set out the semantic view of conditionals in question, and I describe some of its historical background. Chapter 2 turns to a striking problem faced by causal decision theorists. A popular formulation of causal decision theory (CDT) appeals to counterfactual conditionals. However, the standard theory of these conditionals has unintuitive consequences in deterministic worlds. In particular, it says that if anything---including the choice you make---were different in the present, then either the laws or nature would be violated, or the distant past would be changed. And as several authors have recently shown, it's easy to transform this consequence of the standard theory of counterfactuals into full-blown counterexamples to CDT. In response to these counterexamples, I develop a contextualist version of CDT, which makes use of the sequence semantics. I then show that the deterministic counterexamples don't arise for my version of CDT. In Chapter 3, I deal with a different puzzle, about whether or not the so-called Desire-as-Belief (DAB) thesis is consistent with decision theory---something that famous arguments of David Lewis seem to show isn't the case. Once again, I show that, if we understand the DAB thesis in a contextualist way---and spell it out using the sequence semantics---then Lewis's arguments against that thesis don't go through. In fact, we can prove a tenability result for the DAB thesis, which shows that it's compatible with decision theory after all. Finally, in Chapter 4, I transition from decision-theoretic issues to epistemological ones. More precisely, I tackle the question of how our credences should change when we learn indicative conditionals. Several famous cases in the literature---notably, Bas van Fraassen's Judy Benjamin problem---seem to show that the standard Bayesian update rules deliver implausible results when we learn conditionals of this kind. However, in the chapter, I show that, if we adopt the sequence semantics, then the Bayesian update rules turn out to deliver the correct results after all. Better still, alternatives to these rules which have been put forward in the literature turn out to be equivalent to the Bayesian rules in my framework---at least in many contexts. Thus, what we end up with is a nice, unified account of how rational agents should update on conditional information: one which fits in well with recent work on the semantics of conditionals. My proposal also relates, in interesting ways, to discussions that have been happening elsewhere in the literature, like discussions about the tenability of the notorious Stalnaker's thesis.
2083

Causal decision theory, context, and determinism
Philosophy and Phenomenological Research 109 (1): 226-260. 2024.

The classic formulation of causal decision theory (CDT) appeals to counterfactuals. It says that you should aim to choose an option that would have a good outcome, were you to choose it. However, this version of CDT faces trouble if the laws of nature are deterministic. After all, the standard theory of counterfactuals says that, if the laws are deterministic, then if anything—including the choice you make—were different in the present, either the laws would be violated or the distant past would…Read more
The classic formulation of causal decision theory (CDT) appeals to counterfactuals. It says that you should aim to choose an option that would have a good outcome, were you to choose it. However, this version of CDT faces trouble if the laws of nature are deterministic. After all, the standard theory of counterfactuals says that, if the laws are deterministic, then if anything—including the choice you make—were different in the present, either the laws would be violated or the distant past would be changed. And as several authors have shown, it's easy to transform this upshot of the standard theory of counterfactuals into full-blown counterexamples to CDT. In response to these counterexamples, I argue here that the problem lies, not so much with CDT's guiding idea—that it's the expected causal consequences of your actions that matter for rational decision-making—but with the fact that the classic formulation of CDT doesn't pay sufficient attention to the context-sensitivity of counterfactuals. I develop a contextualist version of CDT which better accounts for this context-sensitivity. And I show that my theory avoids the problems faced by the classic formulation of CDT in determinstic worlds.

Causal Decision Theory Causation and Laws of Nature
140

The punctuated equilibrium of scientific change: a Bayesian network model
with Patrick Grim, Frank Seidl, Isabell N. Astor, and Caroline Diaso

Synthese 200 (4): 1-25. 2022.

Our scientific theories, like our cognitive structures in general, consist of propositions linked by evidential, explanatory, probabilistic, and logical connections. Those theoretical webs ‘impinge on the world at their edges,’ subject to a continuing barrage of incoming evidence. Our credences in the various elements of those structures change in response to that continuing barrage of evidence, as do the perceived connections between them. Here we model scientific theories as Bayesian nets, wit…Read more
Our scientific theories, like our cognitive structures in general, consist of propositions linked by evidential, explanatory, probabilistic, and logical connections. Those theoretical webs ‘impinge on the world at their edges,’ subject to a continuing barrage of incoming evidence. Our credences in the various elements of those structures change in response to that continuing barrage of evidence, as do the perceived connections between them. Here we model scientific theories as Bayesian nets, with credences at nodes and conditional links between them modelled as conditional probabilities. We update those networks, in terms of both credences at nodes and conditional probabilities at links, through a temporal barrage of random incoming evidence. Robust patterns of punctuated equilibrium, suggestive of ‘normal science’ alternating with ‘paradigm shifts,’ emerge prominently in that change dynamics. The suggestion is that at least some of the phenomena at the core of the Kuhnian tradition are predictable in the typical dynamics of scientific theory change captured as Bayesian nets under even a random evidence barrage.

Punctuated Equilibrium Theory Change Updating Principles
1692

Scientific Theories as Bayesian Nets: Structure and Evidence Sensitivity
with Patrick Grim, Frank Seidl, Hinton E. Rago, Isabell N. Astor, Caroline Diaso, and Peter Ryner

Philosophy of Science 89 (1): 42-69. 2022.

We model scientific theories as Bayesian networks. Nodes carry credences and function as abstract representations of propositions within the structure. Directed links carry conditional probabilities and represent connections between those propositions. Updating is Bayesian across the network as a whole. The impact of evidence at one point within a scientific theory can have a very different impact on the network than does evidence of the same strength at a different point. A Bayesian model allow…Read more
We model scientific theories as Bayesian networks. Nodes carry credences and function as abstract representations of propositions within the structure. Directed links carry conditional probabilities and represent connections between those propositions. Updating is Bayesian across the network as a whole. The impact of evidence at one point within a scientific theory can have a very different impact on the network than does evidence of the same strength at a different point. A Bayesian model allows us to envisage and analyze the differential impact of evidence and credence change at different points within a single network and across different theoretical structures.

The Nature of Theories Bayesian Reasoning, Misc Conditionalization Updating Principles

Calum McNamara

Emergent Alignment and the Projectability of Ethical Personas
with Guillermo Del Pinal, Youngchan Lee, and Alejandro Pérez Carballo

Choice and Credence in Context
Dissertation, University of Michigan, Ann Arbor. 2024.

Causal decision theory, context, and determinism
Philosophy and Phenomenological Research 109 (1): 226-260. 2024.

The punctuated equilibrium of scientific change: a Bayesian network model
with Patrick Grim, Frank Seidl, Isabell N. Astor, and Caroline Diaso

Synthese 200 (4): 1-25. 2022.

Scientific Theories as Bayesian Nets: Structure and Evidence Sensitivity
with Patrick Grim, Frank Seidl, Hinton E. Rago, Isabell N. Astor, Caroline Diaso, and Peter Ryner

Philosophy of Science 89 (1): 42-69. 2022.

Calum McNamara

Emergent Alignment and the Projectability of Ethical Personas with Guillermo Del Pinal, Youngchan Lee, and Alejandro Pérez Carballo

Choice and Credence in Context Dissertation, University of Michigan, Ann Arbor. 2024.

Causal decision theory, context, and determinism Philosophy and Phenomenological Research 109 (1): 226-260. 2024.

The punctuated equilibrium of scientific change: a Bayesian network model with Patrick Grim, Frank Seidl, Isabell N. Astor, and Caroline Diaso Synthese 200 (4): 1-25. 2022.

Scientific Theories as Bayesian Nets: Structure and Evidence Sensitivity with Patrick Grim, Frank Seidl, Hinton E. Rago, Isabell N. Astor, Caroline Diaso, and Peter Ryner Philosophy of Science 89 (1): 42-69. 2022.

Emergent Alignment and the Projectability of Ethical Personas
with Guillermo Del Pinal, Youngchan Lee, and Alejandro Pérez Carballo

Choice and Credence in Context
Dissertation, University of Michigan, Ann Arbor. 2024.

Causal decision theory, context, and determinism
Philosophy and Phenomenological Research 109 (1): 226-260. 2024.

The punctuated equilibrium of scientific change: a Bayesian network model
with Patrick Grim, Frank Seidl, Isabell N. Astor, and Caroline Diaso

Synthese 200 (4): 1-25. 2022.

Scientific Theories as Bayesian Nets: Structure and Evidence Sensitivity
with Patrick Grim, Frank Seidl, Hinton E. Rago, Isabell N. Astor, Caroline Diaso, and Peter Ryner

Philosophy of Science 89 (1): 42-69. 2022.