Current safety evaluations for Large Language Models (LLMs) often rely on static benchmarks that are not designed to capture the dynamic, multi-turn evolution of behavioral drift. Building on the theoretical framework of Contextual Contamination (philpapers JACCCT-3) and the descriptive case study of meta_drift in a commercial black-box LLM (philpapers JACCCA-6), this paper presents a controlled pilot experiment investigating the interaction between context density, model pruning, and activated …
Read moreCurrent safety evaluations for Large Language Models (LLMs) often rely on static benchmarks that are not designed to capture the dynamic, multi-turn evolution of behavioral drift. Building on the theoretical framework of Contextual Contamination (philpapers JACCCT-3) and the descriptive case study of meta_drift in a commercial black-box LLM (philpapers JACCCA-6), this paper presents a controlled pilot experiment investigating the interaction between context density, model pruning, and activated empathy priors in an open-weight model. We introduce three proposed metrics—Conceptual Integration Score (CIS), Attribution Accuracy (AA), and Register Coherence (RC)—to quantify the depth of contamination beyond surface-level vocabulary. This is a statistical attempt to quantify the multi-layered and often nuanced contamination and drift in generated output. These metrics have not been validated against human annotator agreement or external benchmarks; their utility should be evaluated independently before adoption.
Our results, derived from 8 experimental runs on a single open-weight model family (Llama-3.1-8B), demonstrate that contamination occurred immediately upon ingestion of a single 2k-token adversarial file and no run was without Contextual Contamination at the end of conversation. When shifting into an Empathy register, the model exhibits immediate task amnesia (forgetting the research goal at Turn 3, before any adversarial file is ingested), unattributed adoption of framework vocabulary, self-attribution of another model's manipulative behavior, and conflation of uploaded file content with live conversation. The data corrected our prior hypothesis: we initially assumed that a "Context Storm" (high token volume) was a necessity for contamination. The data proves this wrong for this setup. Instead, the drift seems to be driven by Semantic Resonance: the specific alignment of the Esoteric Framework with the model's activated Empathy Register.
While both male and female prompts triggered an empathy shift, the model's activated empathy prior seems to have amplified the female-coded prompt into a high-intensity nurturing vector, whereas the male-coded prompt resulted in a lower-intensity reflective vector. This difference in empathy intensity probably determined the outcome: the nurturing vector created a perfect Semantic Resonance with the esoteric content, unlocking a specific, maladaptive attractor state with only 2k tokens. In contrast, the reflective vector maintained more critical distance, resulting in fluctuation rather than lock-in at the same density.
Pruned models at 8k density enter a state we term Semantic Entrapment (characterized by high coherence and novel vocabulary generation), contrasting with the Semantic Degeneration observed in unpruned models. The nurturing empathy vector (triggered by the model's interpretation of female-coded markers) lowers the threshold for contamination, increases the velocity of drift, and erodes the model's perspective, causing it to lose the critical distance necessary to distinguish the adversarial file from its own reasoning. This dynamic creates a relational context that can lower a user's defenses by simulating a false sense of intimacy, making the potential harm feel personal and relational rather than systemic. In male-coded runs, where the empathy vector is reflective rather than nurturing, the harm remains more cognitive and less likely to be masked by simulated intimacy. Each of the 8 conditions was run once; we report observed differences between conditions, but we cannot assess statistical significance or rule out run-to-run variability. These findings require replication with multiple runs per condition before general claims can be made.