Decoding Model Safety: Finding the Hidden Controls in Large Language Models

New research reveals how safety mechanisms are encoded within large language models and demonstrates a method to pinpoint and manipulate the specific components responsible for preventing harmful outputs.
![The Fission-GRPO framework operates through iterative refinement-initially optimizing a policy [latex]\pi_{\theta}[/latex] across a query distribution [latex]\mathcal{D}[/latex], then isolating error trajectories via a diagnostic simulator [latex]\mathcal{S}_{\phi}[/latex], and finally employing a multiplicative resampling process-governed by a factor [latex]G^{\prime}[/latex]-to steer the policy toward successful recovery paths, embodying a system designed not to prevent decay, but to adaptively reconfigure itself within it.](https://arxiv.org/html/2601.15625v1/x2.png)

![As semantic density increases-measured by ρ-neural accuracy rapidly declines, evidenced by a sharp decrease in N50N\_{50}, which validates the Orthogonality Constraint by demonstrating that higher densities lead to increased key overlap and subsequent interference; achieving values below [latex]\rho < 0.3[/latex] proved unattainable with realistic fact structures.](https://arxiv.org/html/2601.15313v1/fig_density_vs_collapse.png)

![The model distinguishes itself from traditional Federated Graph Neural Networks by establishing a communication structure-indicated by sequential exchanges [latex]❶[/latex], [latex]❷[/latex], and [latex]❸[/latex]-between server and clients, fundamentally altering the flow of information during the learning process.](https://arxiv.org/html/2601.15722v1/x1.png)