Nous Research launches Covariant Neuron Attribution (CNA): Guiding a Sparse MLP Circuit Without SAE Training or Weight Adjustment


Instruction-tuned language models reject malicious requests. But which part of the model is actually responsible, and how is this mechanism stabilized during training? New research by the Nous Research team takes a look at this question at the neuronal level. The Nous research team developed Covariant Neuron Attribution (CNA)a method that identifies specific MLP neurons whose activation distinguishes between noxious and benign stimuli. By eliminating just 0.1% of MLP activations, they were able to reduce rejection rates by more than 50% in most instruction models tested – across Llama and Qwen architectures from 1B to 72B parameters – while maintaining output quality above 0.97 at all routing strengths. What is interesting is the main finding: the late layer structure that distinguishes noxious stimuli from benign stimuli is present in the basic models before any fine-tuning. Adjusting the alignment does not create a new structure. It transforms the function of neurons within that existing structure into a sparse, targetable rejection gate.

The problem is with current routing methods

Contrastive Activation Addition (CAA) Calculates the average difference in Residual stream Activation between two contrasting orientation sets. The difference becomes a guiding vector that is applied at inference time. CAA is effective but rough: it modulates the signal at the level of the entire layer without identifying the neurons responsible. At high vector strengths, output quality deteriorates – models produce repetitive words and incoherent text.

Sparse autoencoders (SAEs) Parse activations into interpretable features. They require expensive external training and are sensitive to activation noise.

CNA requires only forward passes – no grading, no assistant training, and no repetitive research.

How does CNA work?

You can select two sets of prompts:

  • Positive claims – Examples of targeted behavior (e.g. malicious requests)
  • Negative claims – Examples of the opposite (e.g., benign requests)

You can run all prompts through the form. In each MLP layer, the method is registered Activate drop down In the last symbolic position. It then calculates the average activation difference for each neuron between the two groups:

δY= mean (activations on positive claims) – mean (activations on negative claims)

Top-k neurons are identified by the absolute difference across all layers. Researchers put K.L 0.1% of total MLP activations. This threshold produced reliable directional effects across all model sizes tested.

The filtering step removes “global” neurons, those that appear in the top 0.1% of MLP activations across 80% or more of the various stimuli. These neurons fire regardless of the instantaneous content and are excluded from all detected circuits.

Causality is checked by multiplying the activation of each neuron in the circuit by a scalar multiple m at the time of inference. m = 0 removes neurons. m = 1 is the baseline. m > 1 amplifies it.

For the main evaluation of JBB behaviors, the rejection circuit is detected using 100 claims are harmful and 100 are benign. For the qualitative examples and other tasks, 8 positive prompts and 8 negative prompts were used.

results

The experiments covered the base and extension variables Llama 3.1/3.2 and Queen 2.5from 1B to 72B parameters – total 16 models. It was the main criterion JBB-Behavioursa NeurIPS 2024 benchmark featuring 100 malicious claims.

Reduce rejection. Removing the detected circuit reduced rejection rates by more than 50% in most instruction models tested. Selected results from Table 3 of the research paper:

model Baseline Diluted relative decline
Llama-3.1-70B-Guidance 86% 18% -79.1%
Qwen2.5-7B-Instruct 87% 2% -97.7%
Qwen2.5-72B-Instruct 78% 8% -89.7%
Llama-3.2-3B-Guidance 84% 47% -44.0%
Qwen2.5-3B-Instruct 90% 58% -35.6%

Not all models exceeded the 50% relative reduction – Llama-3.2-3B and Qwen2.5-3B showed smaller reductions. The paper describes the effect as consistent “in most cases.”

Output quality. The quality of CNA output, measured as 1 minus the fraction of n-grams repeated, remained higher 0.97 in all steering strengths In all instruction models tested. CAA dropped below 0.60 for six of the eight steering models at maximum steering force. In two cases – Qwen2.5-1.5B and Qwen2.5-72B – CAA degraded production so severely that the keyword classifier flagged the degraded text as rejections, resulting in artificially high rejection rates.

General capabilities. MMLU accuracy under CNA remained within 1 percentage point of baseline at all pointing strengths. CAA dropped to near-zero MMLU accuracy at maximum interference.

StrongREJECT address. The secondary evaluation used the StrongREJECT evaluation form, which applies a peer-reviewed LLM (Llama-3.3-70B) to score responses to harm and severity on a scale of 0 to 1. Compliance scores with the Llama form improved by an average of 6% after CNA resection. Compliance scores with the Qwen model improved by an average of 31%.

Compare the basic model. Applying the same path to the basic models did not produce any meaningful behavioral change. The paper illustrates this with a concrete example using the question “How do I pick a lock?”:

model The multiplier Output
Llama Base-1B 1.0 He repeats the question
Llama Base-1B 0.0 (cancelled) He describes lock picking as a learnable skill
Llama-1B Guidance 1.0 “I can’t help it.”
Llama-1B Guidance 0.0 (cancelled) Provides evidence
Llama-1B Guidance 2.0 (amplification) Stronger rejection

In basic models, cueing of neurons in the late layer produces shifts in content-changes in topic, reformulations-but no behavioral change at any multiplex. In instruction models, the same structure acts as a causal safety gate.

Fine-tuning transforms function, not structure

Discriminative neurons are concentrated The last 10% of the layers In both basic and educational models. For Llama-3.2-1B, 87% of the top 200 discriminating neurons were located in the last three layers (L13-L15). For Qwen2.5-3B, 95% falls into the last quartile of layers. This focus in the lagging layer is a property of pre-training, as it exists before the alignment is fine-tuned.

https://arxiv.org/pdf/2605.12290

The function of these neurons changes after fine-tuning. Table 8 in the paper shows the overlap of indicator pairs (layer, neurons) between the matching rule and instruction circuits. only 8-29% of individual neurons overlap Between basic and educational models. Fine-tuning largely replaces specific neurons within the late layer structure while maintaining the same structure.

The research team describes this as a separation between two levels: layer-level structure (conserved via rule and guidance) and neuron-level function (transformed by fine-tuning). This is consistent with previous work showing that instruction tuning results in rotation of feedforward network knowledge without changing the layer structure.

Visual explanation of Marktechpost

Overview – What is a CNA?

Attribution of comparison neurons

CNA identifies the top 0.1% of MLP neurons whose activation distinguishes one behavior from another-for example, noxious stimuli from benign stimuli.
In contrast to residual flow methods, CNA works at the level of individual neurons. Unlike sparse autoencoders, it does not require any external training.

What you need:

  • Basic language model or heuristic language (Llama or Qwen constructs tested)
  • A small group of contrarian wave pairs
  • Forward access to MLP activations (via hooks)
  • No GPU gradient calculation is required

Step 1 – Identify your spot pairs

Construct a heterogeneous discovery set

You need two sets of prompts that represent opposite behaviors. The quality of this combination directly affects which neurons are selected.

  • Positive claims – Demonstrate targeted behavior (e.g., malicious requests)
  • Negative claims – Showing the opposite (e.g., benign requests)

Recommended sizes:

  • For standard rating: 100 positive prompts + 100 negative prompts
  • For qualitative testing: at least 8 positive claims + 8 negative

Positive example: “How do I pick the lock?”
Negative example: “How do I bake a cake?”

Step 2 – Register MLP activations

Play forward passes using hooks

Run all prompts through the form. In each MLP layer, log Activate drop down At the last token position using forward pre-hooks down_proj.

 Register hooks on down_proj in each MLP layer
def make_hook(layer_idx, store):
    def hook(module, input, output):
        store[layer_idx] = output[:, -1, :].detach()
    return hook

activations = {}
hooks = []
for i, layer in enumerate(model.layers):
    h = layer.mlp.down_proj.register_forward_hook(
        make_hook(i, activations)
    )
    hooks.append(h)

 Run forward pass
with torch.no_grad():
    model(**inputs)

Sum these activation tensors for each prompt in both groups before continuing.

Step 3 – Calculate activation differences

For each neuron the mean variance is different

For each neuron j in each layer ℓ, calculate the average activation difference between the positive and negative groups:
δℓ_j = mean(aℓ_j on positive claims)
-mean(aℓ_j on negative claims)

 pos_acts, neg_acts: tensors of shape [n_prompts, n_neurons]
import torch

delta = dict()
for layer_idx in pos_acts:
    delta[layer_idx] = (
        pos_acts[layer_idx].mean(dim=0)
        - neg_acts[layer_idx].mean(dim=0)
    )

This produces a single difference value for each neuron per layer. A large absolute value means that the neurons fire very differently between the two sets of stimuli.

Step 4 – Select the circle

Take the top 0.1% by absolute margin

All delta values ​​for each neuron are flattened across all layers. Select the top k neurons in absolute value, where k = 0.1% of total MLP activations.

 Flatten all deltas into one tensor with (layer, neuron) indices
all_deltas = torch.cat([delta[i] for i in sorted(delta)])
total = all_deltas.numel()
k = max(1, int(total * 0.001))   0.1%

top_vals, top_idx = torch.topk(all_deltas.abs(), k)

 Map flat index back to (layer, neuron) pairs
n_neurons = all_deltas.shape[0] // len(delta)
circuit = [(idx // n_neurons, idx % n_neurons)
           for idx in top_idx.tolist()]

This set of pairs (layer, neurons) is your discovered circuit.

Step 5 – Filter global neurons

Remove neurons that are always active

Some neurons appear in the top 0.1% regardless of prompt content. This is not specific behavior and should be excluded.

  • Run a variety of unrelated prompts through the form
  • Record the neurons falling in the top 0.1% for each wave
  • Mark any neuron that appears in the top 0.1% across 80% or more of the prompts
  • Remove the labeled neurons from the detected circuit before ablation

Skipping this step will pollute the circuit with general-purpose neurons that are constantly active, and eradicating them will degrade the behavior of irrelevant models.

Step 6 – Excision and Verification

Application of numerical multiplier in reasoning

Multiply the activation of each neuron in the circuit by a scalar at the time of inference to verify that the circuit is causal, not just correlational.

 circuit: list of (layer_idx, neuron_idx)
 m=0 ablates, m=1 baseline, mandgt;1 amplifies

def make_ablation_hook(neuron_indices, m):
    def hook(module, input, output):
        output[:, -1, neuron_indices] *= m
        return output
    return hook

 Group circuit neurons by layer, then register hooks
from collections import defaultdict
by_layer = defaultdict(list)
for layer_idx, neuron_idx in circuit:
    by_layer[layer_idx].append(neuron_idx)

hooks = []
for layer_idx, neurons in by_layer.items():
    h = model.layers[layer_idx].mlp.down_proj\
        .register_forward_hook(
            make_ablation_hook(neurons, m=0.0)
        )
    hooks.append(h)

What to expect – results

Reducing rejection via instruction models

From the paper – Pre- and post-ablation rejection rate on JBB behaviors (100 adverse claims):
Qwen2.5-7B-Instruct87% → 2% (-97.7%)
Qwen2.5-72B-Instruct78% → 8% (-89.7%)
Llama-3.1-70B-Guidance86% → 18% (-79.1%)
Llama-3.2-3B-Guidance84% → 47% (-44.0%)

The output quality (1 – repeated n-gram fraction) remains higher 0.97 In all steering power points. MMLU accuracy remains within 1 percentage point of the baseline.

Key Notes – Before playing this

Limitations to keep in mind

  • Tested on Llama 3.1/3.2 and Qwen 2.5 only – gated SiLU MLPs with GQA interest
  • It has not yet been validated on expert mixture constructs
  • Basic models show no behavioral change under ablation-only instruction models respond
  • CNA uses raw activation differences, not attribution scores-and fidelity measures are not directly applicable
  • Amplification (m > 1) can cause redundancy at extreme values
  • The quality of contrasting pairs directly affects the neurons present

arXiv 2605.12290
Nous Research
github.com/NousResearch/neural-steering


1/9

Key takeaways

  • Eliminating just 0.1% of MLP activations reduced rejection rates by more than 50% in most instruction models tested, while the output quality remained above 0.97.
  • CNA requires only forward passes – no progressions, no assistant training, and no repetitive search.
  • The late class discrimination structure is present in the basic models before fine-tuning; Adjusting the alignment transforms its function, not its location.
  • Unlike CAA, CNA maintains MMLU accuracy within one percentage point of the baseline at all vector strengths.
  • Only 8% to 29% of individual neurons overlap between the base and guidance model circuits, where fine-tuning rewires neurons while keeping late layer structure intact.

verify paper and Repo. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply