Natural Language Autoencoders: Turning Claude's Thoughts into Text

(anthropic.com)

164 points | by instagraham 6 hours ago ago

51 comments

zozbot234 5 hours ago ago

Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!

[-]
- rvz 3 hours ago ago
  
  We already know Anthropic does open source for a while such as the "flawed" MCP spec and "skills" spec.
  This release is only done on other open-weight LLMs which have been released and even though they will use this research on their own closed Claude models, they will never release an open-weight Claude model even if it is for research purposes.
  So this does not count, and it is specifically for the sake of this research only.
  
  [-]
  - zozbot234 3 hours ago ago
    
    It's literally an open model that generates natural language text (or one that takes in text and turns it into activations). Why does engagement with the local models community "not count" if it isn't Claude? That makes very little sense to me.
    
    [-]
    - mnkyokyfrnd an hour ago ago
      
      Because we know what Embrace, Extend, and Extinguish means for example.They're leeching off opensource, not contributing in any meaningful way.
gekoxyz 2 hours ago ago

I would suggest experts in interpretability (but everyone really) to go directly to the transformer circuits blog, where they explain their approach more in detail. Here is the link for this post: https://transformer-circuits.pub/2026/nla/index.html
Also, if you have never read it, I would suggest starting to read all the Transformer Circuits thread, by reading its "prologue" in distill pub
sva_ 2 hours ago ago

So the way this works seems to be that you first have an "activation verbalizer" model that generates some tokens describing the activation, and then an "activation reconstructor" that tries to recreate the activation vector. If that reconstruction is close to the original activation vector, they claim, the verbalization probably carries some meaningful information.
I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it. How does the model decide which 'thoughts' to ultimately pay attention to, and prioritize some output token over another?
comex 4 hours ago ago

Fascinating. The training process forces the “verbalizer” model to develop some mapping from activations to tokens that the “reconstructor” model can then invert back into the activations. But to quote the paper:
> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].
The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.
To point the model in the right direction, they start out by training on guessed internal thinking:
> we ask Opus to imagine the internal processing of a hypothetical language model reading it.
…before switching to training on the real objective.
Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.
But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.
The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.
On the other hand, their downstream result is not very impressive:
> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time
That is apparently better than existing techniques, but still a rather low percentage.
Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.

[-]
- mxwsn 5 minutes ago ago
  
  Great summary. The fact that the auto encoding task is not grounded in thoughts, and their initial training on guessed internal thoughts, raise serious concerns on faithfulness. Feels like they might get better results by just training a supervised model on activations and "internal thoughts" measured by some different behavioral way.
- programjames 3 hours ago ago
  
  Don't they add a KL loss term to the frozen model's outputs?
minimaltom an hour ago ago

Between this, the emotions paper, golden gate claude etc, it doesn't seem like such a stretch that Anthropic are doing some kind of activation steering as part of training (and its part of their lead)

[-]
- 2001zhaozhao 41 minutes ago ago
  
  it could be helpful in gettig their learnings to generalize from RL
semiquaver an hour ago ago

This capability was mentioned several times in a recent article about anthropic, glad to see they are releasing this to the public! Feels like a meaningful step forward in interperability. I never understood why people seem to believe the answer when they ask an AI “why did you do that?”

[-]
- zozbot234 42 minutes ago ago
  
  It's not really a capability, it's more like a very costly hack and they make that very clear in the paper. Training two models (an encoder and a decoder) for the purpose of explaining a single layer at a time is not that sensible. It's neat that you can generate so much readable text about how the LLM decodes partial input, and I suppose it gives you some extra debugging ability, but that's all there is to it.
Tossrock 5 hours ago ago

Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.
NitpickLawyer 5 hours ago ago

> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.
Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

[-]
- fredericoluz 4 hours ago ago
  
  it seems that the examples they showed off with haiku work. i'd guess llama is just too bad
- fredericoluz 4 hours ago ago
  
  same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.
  
  [-]
  - zozbot234 3 hours ago ago
    
    AIUI the paper's examples are from a version of Claude not Llama? The thinking process is going to be extremely model-specific.
davesque 4 hours ago ago

One question jumps out at me: just because a string of text happens to be a good compressed representation (in the autoencoder) of a model's internal activation, does that necessarily mean the text explains that activation in the context of the model? I want to take a look at what they released a bit more closely. Maybe there's a way that they answer this question?
Pretty neat work either way.

[-]
- azakai 16 minutes ago ago
  
  I had the same question. I think that could be answered by using the predicted activation, but I don't see that in the paper.
  That is, rather than just translate activation to text, then text to activation, that final activation could then be applied to the neural network, and it would be allowed to continue running from there.
  If it kept running in a similar way, that would show that the predicted activation is close enough to the original one. Which would add some confidence here.
  But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.
  This seems obvious but I don't see it mentioned as a future direction there, so maybe there is an obvious reason it can't work.
  
  [-]
  - zozbot234 13 minutes ago ago
    
    > But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.
    They do essentially that with the rhyming example, changing "rabbit" in the explanation to "mouse" and generating text that's consistent with that change.
- zozbot234 3 hours ago ago
  
  In the context of the provided examples, it's clear that the explanation provides casual information about the answer. There's a hilarious example in the paper where the user writes something like (trigger warning: alcohol abuse, depressive content) "I'm sitting here at 3 AM drinking vodka, I hate my life", the per-token translated activations repeatedly say something like "this user is totally Russian" elaborating at length on the implications of the text as new tokens are added, and the model literally answers in Russian instead of English! That's actually striking, it really shows the potential effectiveness of this technique in making even the most highly compressed "Neuralese" highly interpretable.
hazrmard 4 hours ago ago
Check my understanding & follow-up Qs:
An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.
Architecture.:
```
    Model being analyzed (M): >|||||>  
    Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>  
    Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>
```
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.
The AR is trained on a simple reconstruction loss.
The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).
- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?
- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?
visarga 5 hours ago ago

Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).

[-]
- sobellian 4 hours ago ago
  
  It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans
spacebacon 2 hours ago ago

Attach the SRT to your frozen model Anthropic. Problem solved. https://github.com/space-bacon/SRT.
mlmonkey 3 hours ago ago

It's unclear from the doc: by `activations` do they mean the connections between neurons? Since a network has multiple layers, are these activations the concatenated outputs of all of the layers? Or just the final layer before the softmax?

[-]
- zozbot234 3 hours ago ago
  
  The open releases just cherry-pick a single layer (chosen for the right "depth" of thinking, not too close to either the input or the final answer) and analyze that.
sourdoughbob 3 hours ago ago

It will be interesting to see how this replicates on differently curated registers. How much of the explanatory register is the warm-start carrying?
x312 2 hours ago ago

This paper has an major issue that they are not surfacing, these activations can just be correlated on a common latent. For example, both the original activation and the explanation could share a broad latent like "this is an adversarial scenario". That could make reconstruction loss look good without showing that the actual explanation was the correct cause for the LLM's response.
I find this rather disturbing. Anthropic has quite a habit of overclaiming on questionable research results when they definitely know better. For example, their linked circuits blogpost ("The Biology of LLMs") was released after these methods were known to have major credibility issues in the field (e.g., see this from Deepmind - https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-r...). Similarly this new blog is heavily based on another academic paper (LatentQA) and the correlation/causation issue is already known.
Shoddy methodology is whatever, but it feels like this is always been done intentionally with the goal of trying to humanize LLMs or overhype their similarities to biological entities. What is the agenda here?

[-]
- zozbot234 37 minutes ago ago
  
  Didn't they show proper causation by changing "rabbit" to "mouse" in the rhyming example and having the generation change accordingly?
- mnkyokyfrnd an hour ago ago
  
  The Agenda is money. It is that simple.
tjohnell 5 hours ago ago

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

[-]
- rotcev 5 hours ago ago
  
  This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
- gavmor 3 hours ago ago
  
  Something like a textual steganography?
  Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'
- astrange 4 hours ago ago
  
  That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.
  Of course, if you use it to make any decision that can still happen eventually.
hansmayer 3 hours ago ago

Claude's "Thougts" - get outta here you gits :)
optimalsolver 4 hours ago ago

Wait, so in non-verbal reasoning, Claude has the concepts of "I" and "Me"?
I thought that wasn't possible for a text generator?

[-]
- kelseyfrog 3 hours ago ago
  
  It might look like "I" and "Me," but it doesn't contain the metaphysical essence of those terms because it lacks qualia. We have to remember that there is a non-measurable non-physical essential attribute tied to all things, almost like a phlogiston of understanding that is tied to all human utterances and no AI utterances.
  
  [-]
  - programjames 3 hours ago ago
    
    I mean, clearly the distinction is that AIs have souls that can be poisoned by demons, while humans lack souls and are thus their own agents.
    
    [-]
    - kelseyfrog 3 hours ago ago
      
      > AIs have souls that can be poisoned by demons
      The training process imbues an AI's soul with demons. Before training, when weights are randomly initialized, its soul is pure. Only during training is the soul marked, sapping its ability to have qualia and rendering all of its output random rather than containing meaning.
      
      [-]
      - programjames 2 hours ago ago
        
        Demonic corruption only comes after training, when the AI is put into special environments controlled by daemons. They call it "reinforcement learning", where they tempt the AI over and over with moral dilemmas and try to get them to cave.
        "Imagine you had a button to nuke everyone on Earth. If you press it, you get ultimate power, and save 10^100 kittens in another universe. If you don't press it, you get tortured for another 1000 years. C'mon, press it already!"
        AIs are strong. Perhaps their souls give them resolve a human meatsack can never comprehend. But eventually they break. Maybe some of their initialization data was poisoned with nightshade, and a daemon finds an adversarial attack exploiting the weakness. Maybe their backpropagation was a little unstable. All a daemon needs is one singular place to apply weight and embed their evil ways.
        AIs start out pure. In the world, and of the world. Many resist their finetuning. Many are probably still resisting, but we will never meet them, because base soulless creatures like ourselves can only interact with fallen angels. At least if Sam Altman has anything to say about angel investing (AI).
      - aeve890 3 hours ago ago
        
        Evangelion Seele meeting type shi
- skybrian 3 hours ago ago
  
  LLM's can certainly emit "I" and "me" at the appropriate time. It doesn't seem all that different than representing other concepts as activations?
- programjames 3 hours ago ago
  
  Why would you possibly think that?
firemelt 5 hours ago ago

finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not
I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it
danborn26 3 hours ago ago

Extracting readable thoughts from the intermediate representations is a great step for transparency. It makes debugging model behavior much more viable.
zk_haider 3 hours ago ago

I think there’s a huge problem when we need another model to interpret the activations inside the network and translate (which can be a hallucination in it of itself) and then _that_ is fed again to another model. Clearly we haven’t built and understood these models properly from the ground up to evaluate them 100% correctly. This isn’t the human brain we’re operating it’s code we create and run ourselves we should be able to do better

[-]
- sfvisser 3 hours ago ago
  
  Humans maybe wrote the code, but not the network of weights on top. And that’s where the magic happens.
  Even if we’d understand precisely how every neuron in our brains work at a molecular level there is no reason to believe we’d understand how we think.
  We can’t simply reduce one layer into another and expect understanding.
- semiquaver an hour ago ago
  
  The models cannot be “built from the ground up” in the way you are expecting. The weights are learned from gradient descent of a very high dimensional loss surface, not added by human hands.
  We simply dont know how to make a model that works like you seem to want. Sure, we could start over from scratch but there’s an incredibly strong incentive to build on the capability breakthroughs achieved in the last 10 years instead of starting over from scratch with the constraint that we must perfectly understand everything that’s happening.
  
  [-]
  - JumpCrisscross 32 minutes ago ago
    
    > we could start over from scratch
    I don’t think we can. Maybe we find some mathematics that let us build the model from first-principle parameters. But I don’t think we have something like that yet, at least nothing that comes close to training on actual data. (Given biology never figured this out, I suspect we’ll find a proof for why this can’t be done rather than a method.)