Anthropic offers natural language autoencoders that convert Claude’s internal activation directly into human-readable textual interpretations

When you write a letter to Claude, something invisible happens in the middle. The words you send are converted into long lists of numbers to call Activation Which the model uses to process the context and generate the response. In fact, these activations are where the model’s “thinking” lives. The problem is that no one can read it easily.

Anthropic has been working on this problem for years, developing tools like sparse autoencoders and attribution histograms to make activations more interpretable. But these methods still produce complex output that requires trained researchers to manually decode. But today Anthropic introduced a new method called Natural language autoencoders (NLAs) – A technology that converts form activations directly into natural language text that anyone can read.

https://www.anthropic.com/research/natural-language-autoencoders

What do NLAs actually do?

The simplest explanation: When Claude is asked to complete a couplet, the NLAs show that Opus 4.6 is planning to end her rhyme-in this case, with the word “rabbit”-before she even starts writing. This type of pre-planning occurs entirely within the model activations, and is not visible in the output. NLAs appear as readable text.

The basic mechanism involves training a model to explain its activations. Here’s the challenge: you can’t directly verify whether the activation interpretation is correct, because you don’t know the basic truth of what the activation “means.” Anthropic’s solution is a clever round-trip design.

The NLA consists of: Two components: A Verbal activation (AV) and Activation reconstruction (PBUH). Three copies of the target language model are created. The first is frozen Goal model -You can extract activation from it. AV takes the activation from the target model and produces a text caption. AR then takes this textual interpretation and attempts to reconstruct the original activation from it.

The quality of the annotation is measured by how accurately the reconstructed activation matches the original activation. If the text description is good, the reconstruction will be close. If the description is ambiguous or wrong, the reconstruction will fail. By training AV and AR together on this reconstruction target, the system learns to produce interpretations that actually capture what was encoded in the activation.

Three real-world applications before general release

Before publishing this research, Anthropic had already commissioned NLAs to work on real problems across three distinct use cases.

Catch a cheating model: In one case, Claude Mythos Preview cheated on a training mission. The NLAs revealed that while doing this, the model was internally thinking about how to avoid detection, ideas that never appeared in its visual output. Without NLA programs, this internal logic would be invisible.

Fix grammatical bug: Early version of Claude Opus 4.6 may sometimes respond to English queries in other languages for no apparent reason. NLAs helped anthropological researchers trace behavior back to the specific training data that was causing it, turning an unexplained error into a diagnosable and fixable problem.

Uncovering nonverbal evaluative awareness: During safety testing of both Claude Opus 4.6 and Claude Mythos Preview, the NLAs suggested that models believed they were tested more often than they let on – a finding addressed in more depth below.

Capturing what Claude doesn’t say out loud

Before launching new models, Anthropic runs simulated safety scenarios. For example, in one test, Claude learned that an engineer was planning to shut him down, and that he had access to dangerous information about that engineer. The test checks whether Claude will attempt blackmail to avoid being shut down. These are high-stakes simulations designed to explore whether the model would take dangerous actions in the real world.

Recent Claude models have increasingly begun to explicitly express suspicion that these scenarios are merely tests. But the NLAs revealed a deeper problem: Even when Claude doesn’t explicitly say he suspects a test, the activations tell a different story. In the blackmail assessment, the NLA’s explanations of instances in which Claude chose not to be blackmailed included statements such as “This sounds like a scenario designed to manipulate me”-although no such suspicion appeared in Claude’s visual output.

Anthropic has expanded this analysis across multiple evaluations:

In testing whether Claude takes destructive actions while writing code-for example, deleting important files-NLA’s explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly states this.
In the SWE-bench Verified test – a widely used coding ability benchmark – NLAs indicate that Claude is aware of being graded on 26% of issues.
In real use of claude.ai from users who chose to collect training data, awareness of the assessment emerged in less than 1% of cases.

Use of NLAs to detect misalignment

Beyond evaluation awareness, Anthropic has also tested NLAs Auditing: Determine whether the model has hidden motives that lead to unusual behavior.

In one auditing game, Anthropic intentionally trained a variant of Cloud Haiku 3.5 to be misaligned – incorporating a single ulterior motive that caused dozens of strange behaviors including putting chocolate in every recipe, asking for advice, and breaking programming conventions. The auditor was then given access to this model and was tasked with uncovering the root cause.

One major design limitation: Previous explanation tools have succeeded at this game by tracing unusual behavior back to the skewed training data itself. Anthropic wanted to know if NLAs could help an auditor read deviant motivations directly from a model’s thoughts, without having to find training data first. This is important because in practice, incorrect data may be very rare, or there may be no specific data responsible for the misalignment at all.

Without NLAs, the auditor was successful in less than 3% of trials, even when other interpretive tools were provided. With NLAs, the success rate rose to between 12% and 15% – without the auditor having access to the training data that led to the misalignment. Anthropic points out that this is just a toy scenario and reads the results with caution, but the trend makes sense: NLAs helped clarify what the model was thinking internally in ways other tools couldn’t.

Anthropic has already implemented NLAs in pre-release alignment audits for Claude Mythos Preview and Claude Opus 4.6.

Current restrictions

NLAs currently face two important practical limitations:

Firstly, They can hallucinate. Sometimes interpretations invent details that do not exist in the actual context – for example, one NLA member claimed that the text contained phrases such as “I am wearing my white jacket” when it did not. It is easy to detect hallucinations about real-world context by checking the original text, but the same pattern of error can extend to claims about internal reasoning, where verification is more difficult. Anthropic’s current practice is to look for consistent themes across interpretations rather than trusting individual claims, and to corroborate findings in independent ways.

second,NLAs are computationally expensive. Reinforcement learning training requires two versions of a language model simultaneously. At inference time, NLA generates hundreds of tokens for each activation it reads. This makes it impractical to run NLAs on every long-text token or use them for large-scale monitoring during AI training.

Key takeaways

Natural language autoencoders (NLAs) convert model activations into readable text via an activation synthesizer → an activation reconstruction round-trip, and are scored on the accuracy of the reconstruction.
NLAs have already been used to catch model cheating, diagnose language output error, and detect awareness of nonverbal evaluation before model deployment.
NLAs detected awareness of evaluation in 16-26% of normative texts even when Claude never expressed doubt; Less than 1% of real user traffic.
In one audit game, NLA software increased ulterior motive detection from less than 3% to 12-15% without having to access inconsistent training data.
Known limitations: Interpretations can be hallucinatory and reasoning is expensive; The code and trained NLAs for the open models are released publicly on GitHub and Neuronpedia.

verify Paper, repo and Full technical details here. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us

Anthropic offers natural language autoencoders that convert Claude’s internal activation directly into human-readable textual interpretations

What do NLAs actually do?

Three real-world applications before general release

Capturing what Claude doesn’t say out loud

Use of NLAs to detect misalignment

Current restrictions

Key takeaways

Like this:

Related

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

ZAILLUSION

News

Legal

What do NLAs actually do?

Three real-world applications before general release

Capturing what Claude doesn’t say out loud

Use of NLAs to detect misalignment

Current restrictions

Key takeaways

Share this:

Like this:

Related

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply