Daniel Herrmann (University of North Carolina, Chapel Hill): Publications

436

Radical AI Interpretability
with Ben Levinstein

We develop a framework for interpreting AI systems as agents, drawing on the philosophical tradition of radical interpretation and the tools of mechanistic interpretability. The core question is: given the computational facts about a system, how do we solve for its beliefs, desires, and meanings? This matters increasingly for safety. We want to be able to trust the systems we deploy, whether by understanding their goals or, more modestly, by reliably detecting deception. Interpretability researc…Read more
We develop a framework for interpreting AI systems as agents, drawing on the philosophical tradition of radical interpretation and the tools of mechanistic interpretability. The core question is: given the computational facts about a system, how do we solve for its beliefs, desires, and meanings? This matters increasingly for safety. We want to be able to trust the systems we deploy, whether by understanding their goals or, more modestly, by reliably detecting deception. Interpretability researchers are building tools to read beliefs and desires off a model's internals, but there is no settled account of when such a tool has succeeded. This book supplies one. We propose criteria on both representationalist and interpretationist approaches, and tie each to tests current interpretability methods can carry out. A central lesson is that these attributions cannot be made piecemeal. Beliefs, desires, and the propositional structure they presuppose are jointly constrained, and a method that fixes one while measuring the others inherits whatever distortions that introduces. This holism becomes pressing for AI systems, which may not share the interpreter's concepts. However, it also provides leverage: a system's attitudes constrain its propositional structure, that structure constrains which attitudes can be attributed, and mechanistic interpretability can help us measure both.

Decision Theory Philosophy of AI, General Works Philosophy of Mind, Miscellaneous Philosophy of AI, Mis…Read more
Decision Theory Philosophy of AI, General Works Philosophy of Mind, Miscellaneous Philosophy of AI, Misc Artificial Intelligence Safety General Philosophy of Science Radical Interpretation
60

Review of Cameron J. Buckner’s From Deep Learning to Rational Machines: What the History of Philosophy Can Teach Us about the Future of Artificial Intelligence- Cameron J. Buckner, From Deep Learning to Rational Machines: What the History of Philosophy Can Teach Us about the Future of Artificial Intelligence. Oxford University Press (review)
with Bruce Rushing

Philosophy of Science 92 (4): 1031-1034. 2025.

Review of Cameron J. Buckner’s From Deep Learning to Rational Machines: What the History of Philosophy Can Teach Us about the Future of Artificial Intelligence - Cameron J. Buckner, From Deep Learning to Rational Machines: What the History of Philosophy Can Teach Us about the Future of Artificial Intelligence. Oxford University Press. - Volume 92 Issue 4.

Science, Logic, and Mathematics
89

Deference and Decision
with Gerard Rothfus

Theory and Decision 1-27. forthcoming.

Consider two principles of rational decision making. One says that you should never use the value of events that are inconsistent with your evidence to assess what to do. The other says that, if you know what a more informed version of yourself would do, then you should already do likewise with your current information. We show that no decision theory which agrees with evidential decision theory (EDT) and causal decision theory (CDT) whenever they agree can satisfy both principles. In particular…Read more
Consider two principles of rational decision making. One says that you should never use the value of events that are inconsistent with your evidence to assess what to do. The other says that, if you know what a more informed version of yourself would do, then you should already do likewise with your current information. We show that no decision theory which agrees with evidential decision theory (EDT) and causal decision theory (CDT) whenever they agree can satisfy both principles. In particular, assuming that agents update via Bayesian conditioning, EDT satisfies the former and violates the latter, while CDT violates the former and satisfies the latter. We show that the causalist can satisfy the first (at the cost of then violating the second) if she either uses a learning rule that changes her beliefs about counterfactuals in a particular way or uses a decision rule more in line with a rule proposed by Hitchcock (2016). We then employ these results to illuminate a core tension at the heart of the debate between evidentialists and causalists.

Newcomb's Problem Causal Decision Theory Evidential Decision Theory
105

Invention and Evolution of Correlated Conventions
with Brian Skyrms

British Journal for the Philosophy of Science 76 (1): 223-241. 2025.

An important feature of many conventions is that the agents use an asymmetry to coordinate their behaviour. We call these ‘correlated conventions’. However, a puzzle arises: since any asymmetry works as well as any other, what are the relevant asymmetries on which a given population founds its correlated conventions? In order to gain traction on this question we need an account of both the invention and evolution of correlated conventions. Invention has remained largely unexplored in the literat…Read more
An important feature of many conventions is that the agents use an asymmetry to coordinate their behaviour. We call these ‘correlated conventions’. However, a puzzle arises: since any asymmetry works as well as any other, what are the relevant asymmetries on which a given population founds its correlated conventions? In order to gain traction on this question we need an account of both the invention and evolution of correlated conventions. Invention has remained largely unexplored in the literature. In this article we provide a simple model of the origin and subsequent dynamics of correlated conventions. This model can serve as a base for future investigation.

Science, Logic, and Mathematics
1588

Standards for Belief Representations in LLMs
with Benjamin A. Levinstein

Minds and Machines 35 (1): 1-25. 2024.

As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count a…Read more
As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count as belief-like. We argue that, while the project of belief measurement in LLMs shares striking features with belief measurement as carried out in decision theory and formal epistemology, it also differs in ways that should change how we measure belief. Thus, drawing from insights in philosophy and contemporary practices of machine learning, we establish four criteria that balance theoretical considerations with practical constraints. Our proposed criteria include accuracy, coherence, uniformity, and use, which together help lay the groundwork for a comprehensive understanding of belief representation in LLMs. We draw on empirical work showing the limitations of using various criteria in isolation to identify belief representations.

Large Language Models
352

Still no lie detector for language models: probing empirical and conceptual roadblocks
with Benjamin A. Levinstein

Philosophical Studies 182 (7). 2025.

We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the pro…Read more
We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. With this lesson in hand, we evaluate two existing approaches for measuring the beliefs of LLMs, one due to Azaria and Mitchell (The internal state of an llm knows when its lying, 2023) and the other to Burns et al. (Discovering latent knowledge in language models without supervision, 2022). Moving from the armchair to the desk chair, we provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. We conclude by suggesting some concrete paths for future work.

Large Language Models The Nature of Belief
825

Naturalizing Natural Salience
with Jacob VanDrunen

British Journal for the Philosophy of Science 77 413-439. 2026.

Grice, Lewis, and Skyrms proposed similar distinctions between kinds of meaning. The meaning of terms in human language, as Lewis and Skyrms had it, is ‘conventional’. Skyrms presented models showing how it is possible for conventional meaning to evolve in a population without reliance on pre-existing meaning. But one might think of conventionality as coming in degrees, based on whether the evolutionary process begins with ‘natural saliences’. We propose a theory of natural salience and several …Read more
Grice, Lewis, and Skyrms proposed similar distinctions between kinds of meaning. The meaning of terms in human language, as Lewis and Skyrms had it, is ‘conventional’. Skyrms presented models showing how it is possible for conventional meaning to evolve in a population without reliance on pre-existing meaning. But one might think of conventionality as coming in degrees, based on whether the evolutionary process begins with ‘natural saliences’. We propose a theory of natural salience and several extensions of Skyrms’s models to capture this notion. These models reveal that natural saliences can hinder, as well as help, the evolution of language.

Convention and Coordination Science, Logic, and Mathematics
127

Prediction with expert advice applied to the problem of prediction with expert advice
Synthese 200 (4): 1-24. 2022.

We often need to have beliefs about things on which we are not experts. Luckily, we often have access to expert judgements on such topics. But how should we form our beliefs on the basis of expert opinion when experts conflict in their judgments? This is the core of the novice/2-expert problem in social epistemology. A closely related question is important in the context of policy making: how should a policy maker use expert judgments when making policy in domains in which she is not herself an …Read more
We often need to have beliefs about things on which we are not experts. Luckily, we often have access to expert judgements on such topics. But how should we form our beliefs on the basis of expert opinion when experts conflict in their judgments? This is the core of the novice/2-expert problem in social epistemology. A closely related question is important in the context of policy making: how should a policy maker use expert judgments when making policy in domains in which she is not herself an expert? This question is more complex, given the messy and strategic nature of politics. In this paper we argue that the prediction with expert advice framework from machine learning provides helpful tools for addressing these problems. We outline conditions under which we should expert PWEA to be helpful and those under which we should not expect these methods to perform well.
99

Sifting the Signal from the Noise
with Jacob VanDrunen

British Journal for the Philosophy of Science 76 (3): 745-758. 2025.

Signalling games are useful for understanding how language emerges. In the standard models, the dynamics in some sense already know what the signals are, even if they do not yet have meaning. In this article, we relax this assumption and develop a simple model we call an ‘attention game’, in which agents have to learn which feature of their environment is the signal. We demonstrate that simple reinforcement learning agents can still learn to coordinate in contexts where the agents do not already…Read more
Signalling games are useful for understanding how language emerges. In the standard models, the dynamics in some sense already know what the signals are, even if they do not yet have meaning. In this article, we relax this assumption and develop a simple model we call an ‘attention game’, in which agents have to learn which feature of their environment is the signal. We demonstrate that simple reinforcement learning agents can still learn to coordinate in contexts where the agents do not already know what the signal is, and the other features in the agents’ environment are uncorrelated with the signal. Furthermore, we show that in cases where other features are correlated with the signal, there is a surprising trade-off between learning what the signal is and success in action. We show that the mutual information between a signal and a feature plays a key role in governing the accuracy and attention of the agent.

Science, Logic, and Mathematics
144

Critical Studies/Book Reviews
with David Peter Wallis Freeborn

Philosophia Mathematica. forthcoming.

The Application of Mathematics
145

PAC Learning and Occam’s Razor: Probably Approximately Incorrect
Philosophy of Science 87 (4): 685-703. 2020.

Computer scientists have provided a distinct justification of Occam’s Razor. Using the probably approximately correct framework, they provide a theorem that they claim demonstrates that we should favor simpler hypotheses. The argument relies on a philosophical interpretation of the theorem. I argue that the standard interpretation of the result in the literature is misguided and that a better reading does not, in fact, support Occam’s Razor at all. To this end, I state and prove a very similar t…Read more
Computer scientists have provided a distinct justification of Occam’s Razor. Using the probably approximately correct framework, they provide a theorem that they claim demonstrates that we should favor simpler hypotheses. The argument relies on a philosophical interpretation of the theorem. I argue that the standard interpretation of the result in the literature is misguided and that a better reading does not, in fact, support Occam’s Razor at all. To this end, I state and prove a very similar theorem that, if interpreted the same way, would justify the contradictory Anti-Occam’s Razor—the principle that we should favor more complex hypotheses.

Simplicity and Parsimony

Daniel Herrmann

Radical AI Interpretability
with Ben Levinstein

Deference and Decision
with Gerard Rothfus

Theory and Decision 1-27. forthcoming.

Invention and Evolution of Correlated Conventions
with Brian Skyrms

British Journal for the Philosophy of Science 76 (1): 223-241. 2025.

Standards for Belief Representations in LLMs
with Benjamin A. Levinstein

Minds and Machines 35 (1): 1-25. 2024.

Still no lie detector for language models: probing empirical and conceptual roadblocks
with Benjamin A. Levinstein

Philosophical Studies 182 (7). 2025.

Naturalizing Natural Salience
with Jacob VanDrunen

British Journal for the Philosophy of Science 77 413-439. 2026.

Prediction with expert advice applied to the problem of prediction with expert advice
Synthese 200 (4): 1-24. 2022.

Sifting the Signal from the Noise
with Jacob VanDrunen

British Journal for the Philosophy of Science 76 (3): 745-758. 2025.

Critical Studies/Book Reviews
with David Peter Wallis Freeborn

Philosophia Mathematica. forthcoming.

PAC Learning and Occam’s Razor: Probably Approximately Incorrect
Philosophy of Science 87 (4): 685-703. 2020.

Daniel Herrmann

Radical AI Interpretability with Ben Levinstein

Deference and Decision with Gerard Rothfus Theory and Decision 1-27. forthcoming.

Invention and Evolution of Correlated Conventions with Brian Skyrms British Journal for the Philosophy of Science 76 (1): 223-241. 2025.

Standards for Belief Representations in LLMs with Benjamin A. Levinstein Minds and Machines 35 (1): 1-25. 2024.

Still no lie detector for language models: probing empirical and conceptual roadblocks with Benjamin A. Levinstein Philosophical Studies 182 (7). 2025.

Naturalizing Natural Salience with Jacob VanDrunen British Journal for the Philosophy of Science 77 413-439. 2026.

Prediction with expert advice applied to the problem of prediction with expert advice Synthese 200 (4): 1-24. 2022.

Sifting the Signal from the Noise with Jacob VanDrunen British Journal for the Philosophy of Science 76 (3): 745-758. 2025.

Critical Studies/Book Reviews with David Peter Wallis Freeborn Philosophia Mathematica. forthcoming.

PAC Learning and Occam’s Razor: Probably Approximately Incorrect Philosophy of Science 87 (4): 685-703. 2020.

Radical AI Interpretability
with Ben Levinstein

Deference and Decision
with Gerard Rothfus

Theory and Decision 1-27. forthcoming.

Invention and Evolution of Correlated Conventions
with Brian Skyrms

British Journal for the Philosophy of Science 76 (1): 223-241. 2025.

Standards for Belief Representations in LLMs
with Benjamin A. Levinstein

Minds and Machines 35 (1): 1-25. 2024.

Still no lie detector for language models: probing empirical and conceptual roadblocks
with Benjamin A. Levinstein

Philosophical Studies 182 (7). 2025.

Naturalizing Natural Salience
with Jacob VanDrunen

British Journal for the Philosophy of Science 77 413-439. 2026.

Prediction with expert advice applied to the problem of prediction with expert advice
Synthese 200 (4): 1-24. 2022.

Sifting the Signal from the Noise
with Jacob VanDrunen

British Journal for the Philosophy of Science 76 (3): 745-758. 2025.

Critical Studies/Book Reviews
with David Peter Wallis Freeborn

Philosophia Mathematica. forthcoming.

PAC Learning and Occam’s Razor: Probably Approximately Incorrect
Philosophy of Science 87 (4): 685-703. 2020.