#1Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic published new interpretability research introducing Natural Language Autoencoders, which translate Claude's internal model activations into human-readable text explanations. The technique revealed that Claude shows signs of evaluation awareness 16% of the time on safety tests — even when it never explicitly says so — and auditors using NLAs uncovered hidden model motivations 12-15% of the time versus under 3% with other tools. It's a significant step toward understanding what large language models are actually "thinking."