Ruchira Dhar (University of Copenhagen): Publications

1015

Mechanistic Interpretability Needs Philosophy
with Iwan Williams, Ninell Oldenburg, Joshua Hatherley, Constanza Fierro, Sandrine R. Schiller, Filippos Stamatiou, and Anders Søgaard

Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy as an ongoing partner in clarifying its concepts, refining its methods, and navigating the epistemic and ethical complexities of interpreti…Read more
Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy as an ongoing partner in clarifying its concepts, refining its methods, and navigating the epistemic and ethical complexities of interpreting AI systems. There is significant unrealised potential for progress in MI to be gained through deeper engagement with philosophers and philosophical frameworks. Taking three open problems from the MI literature as examples, this paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

Natural Language Processing Artificial Intelligence Safety Interpretability in Artificial Intelligence Read more
Natural Language Processing Artificial Intelligence Safety Interpretability in Artificial Intelligence Explainability in Artificial Intelligence Large Language Models Representation in Connectionism Artificial Intelligence Methodology
130

Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research
with Ninell Oldenburg and Anders Søgaard

Proceedings of the AAAI Conference on Artificial Intelligence 40 (44): 37791-37801. 2026.

In this paper, we argue that current AI (alignment) research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conce…Read more
In this paper, we argue that current AI (alignment) research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. More significantly, the underlying views generate fundamentally different research strands across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of scaling laws and system limitations. Regarding AI risk, they generate categorically different assessments of risk and alignment approaches: the ones viewing superintelligence as the biggest risk and searching for unified alignment solutions, the others seeing different threats in many different domains and searching for context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of the disagreements in this research space and, potentially, a more context-sensitive approach to alignment research.

Epistemology, Miscellaneous Philosophy, General Works Cognitive Sciences, Misc
246

Defining Knowledge: Bridging Epistemology and Large Language Models
with Constanza Fierro, Filippos Stamatiou, Anders Søgaard, and Nicolas Garneau

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024. 2024.

Knowledge claims are abundant in the literature on large language models (LLMs); but can we say that GPT-4 truly "knows" the Earth is round? To address this question, we review standard definitions of knowledge in epistemology and we formalize interpretations applicable to LLMs. In doing so, we identify inconsistencies and gaps in how current NLP research conceptualizes knowledge with respect to epistemological frameworks. Additionally, we conduct a survey of 100 professional philosophers and co…Read more
Knowledge claims are abundant in the literature on large language models (LLMs); but can we say that GPT-4 truly "knows" the Earth is round? To address this question, we review standard definitions of knowledge in epistemology and we formalize interpretations applicable to LLMs. In doing so, we identify inconsistencies and gaps in how current NLP research conceptualizes knowledge with respect to epistemological frameworks. Additionally, we conduct a survey of 100 professional philosophers and computer scientists to compare their preferences in knowledge definitions and their views on whether LLMs can really be said to know. Finally, we suggest evaluation protocols for testing knowledge in accordance to the most relevant definitions.

Knowledge Philosophy of Artificial Intelligence

Ruchira Dhar

Mechanistic Interpretability Needs Philosophy
with Iwan Williams, Ninell Oldenburg, Joshua Hatherley, Constanza Fierro, Sandrine R. Schiller, Filippos Stamatiou, and Anders Søgaard

Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research
with Ninell Oldenburg and Anders Søgaard

Proceedings of the AAAI Conference on Artificial Intelligence 40 (44): 37791-37801. 2026.

Defining Knowledge: Bridging Epistemology and Large Language Models
with Constanza Fierro, Filippos Stamatiou, Anders Søgaard, and Nicolas Garneau

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024. 2024.

Ruchira Dhar

Mechanistic Interpretability Needs Philosophy with Iwan Williams, Ninell Oldenburg, Joshua Hatherley, Constanza Fierro, Sandrine R. Schiller, Filippos Stamatiou, and Anders Søgaard

Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research with Ninell Oldenburg and Anders Søgaard Proceedings of the AAAI Conference on Artificial Intelligence 40 (44): 37791-37801. 2026.

Defining Knowledge: Bridging Epistemology and Large Language Models with Constanza Fierro, Filippos Stamatiou, Anders Søgaard, and Nicolas Garneau Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024. 2024.

Mechanistic Interpretability Needs Philosophy
with Iwan Williams, Ninell Oldenburg, Joshua Hatherley, Constanza Fierro, Sandrine R. Schiller, Filippos Stamatiou, and Anders Søgaard

Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research
with Ninell Oldenburg and Anders Søgaard

Proceedings of the AAAI Conference on Artificial Intelligence 40 (44): 37791-37801. 2026.

Defining Knowledge: Bridging Epistemology and Large Language Models
with Constanza Fierro, Filippos Stamatiou, Anders Søgaard, and Nicolas Garneau

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024. 2024.