Ayan Sivaram (New York University): Publications - PhilPeople

New York University
Department of Philosophy
Stern School of Business

Undergraduate

New York City, New York, United States of America

Inner Misalignment Should Be Understood as a Reference Class Problem

A mesa-optimizer trained on a distribution implicitly learns a “reference class," a partition of situations it treats as relevantly similar for the purposes of its objective. I argue that inner alignment failure, as characterized by Hubinger et al. [2019], is productively understood as an instance of the reference class problem. The mesa-objective is calibrated to the learned reference class and when deployment presents situations that fall outside this class, the objective diverges from what tr…Read more
A mesa-optimizer trained on a distribution implicitly learns a “reference class," a partition of situations it treats as relevantly similar for the purposes of its objective. I argue that inner alignment failure, as characterized by Hubinger et al. [2019], is productively understood as an instance of the reference class problem. The mesa-objective is calibrated to the learned reference class and when deployment presents situations that fall outside this class, the objective diverges from what training intended in a manner that is structured, predictable, and specific to the features the optimizer learned to treat as load-bearing. This framing connects the technical inner alignment problem to a philosophical literature running from Venn through Hajék, and generates diagnostic predictions absent from the standard framework.

Philosophy of Probability Philosophy of Artificial Intelligence