New York City, New York, United States of America
  • A mesa-optimizer trained on a distribution implicitly learns a “reference class," a partition of situations it treats as relevantly similar for the purposes of its objective. I argue that inner alignment failure, as characterized by Hubinger et al. [2019], is productively understood as an instance of the reference class problem. The mesa-objective is calibrated to the learned reference class and when deployment presents situations that fall outside this class, the objective diverges from what tr…Read more