Recent work on ‘emergent misalignment’ has shown that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the ‘persona selection’ (PSM) hypothesis that, during pre-training, LLMs learn to simulate many different characters and perspectives, which can then be elicited and refined during post-training. Inspired by those results, this paper investigates the converse phenomenon, ‘emergent alignment’, and uses it to support and refine the PSM and motivate a novel des…
Read moreRecent work on ‘emergent misalignment’ has shown that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the ‘persona selection’ (PSM) hypothesis that, during pre-training, LLMs learn to simulate many different characters and perspectives, which can then be elicited and refined during post-training. Inspired by those results, this paper investigates the converse phenomenon, ‘emergent alignment’, and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the ‘Constitutional AI’ (CAI) approach and use four constitutions drawn from ethical systems that could be part of reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to and concerned solely with the good of humanity. For each of those models, we show that fine-tuning on two narrow safety sub-categories (harassment and illegal behaviors) reliably induces emergent alignment. Specifically, the narrowly aligned models perform significantly better than the helpful-only source model on a benchmark covering a representative sample of general safety categories, and on specific safety categories that were carefully filtered-out of the data sets used for narrow alignment finetuning. To test the ‘PSM’ using a more fine-grained evaluation, we also use a multidimensional persona-diagnostic which included dimensions for deontological, consequentialist, virtue-ethical, and “defer-to-authorities” ethical personas. For each constitutionally finetuned (broad and narrow) model, we evaluate how well their behavior matches their expected signature profile (given their anchor constitution). Our results show that our CAI models acquire their expected “ethical persona”—e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. At the same time, both our coarse and fine-grained evaluations show that there are significant differences across our (broad and narrow finetuned) CAI models in how well they project. Based on those results, we argue that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.