A mesa-optimizer trained on a distribution implicitly learns a “reference class," a partition of situations it treats as relevantly similar for the purposes of its objective. I argue that inner alignment failure, as characterized by Hubinger et al. [2019], is productively understood as an instance of the reference class problem. The mesa-objective is calibrated to the learned reference class and when deployment presents situations that fall outside this class, the objective diverges from what tr…
Read moreA mesa-optimizer trained on a distribution implicitly learns a “reference class," a partition of situations it treats as relevantly similar for the purposes of its objective. I argue that inner alignment failure, as characterized by Hubinger et al. [2019], is productively understood as an instance of the reference class problem. The mesa-objective is calibrated to the learned reference class and when deployment presents situations that fall outside this class, the objective diverges from what training intended in a manner that is structured, predictable, and specific to the features the optimizer learned to treat as load-bearing. This framing connects the technical inner alignment problem to a philosophical literature running from Venn through Hajék, and generates diagnostic predictions absent from the standard framework.