Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other insights into the mysterious AI brain)

Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other insights into the mysterious AI brain)

Join us as we return to New York on June 5 to work with executive leaders to explore comprehensive methods for auditing AI models for bias, performance, and ethical compliance in diverse organizations. Find out how you can be present here.


AI models are mysterious: they spit out answers, but there's no real way to know the 'thought' behind their answers. This is because their brains work at a fundamentally different level than ours – processing long lists of neurons related to countless different concepts – so we simply cannot understand their train of thought.

But now, for the first time, researchers have been able to get a glimpse into the inner workings of the AI ​​mind. The Anthropic team has revealed how it uses 'dictionary learning' Claude Sonnet to reveal pathways in the model's brain that are activated by different subjects – from people, places and emotions to scientific concepts and things even more abstract.

Interestingly, these features can be turned on and off manually or strengthened – ultimately allowing researchers to control model behavior. Notably, when a 'Golden Gate Bridge' element within Claude was enhanced and the model was then asked about its physical form, it declared it to be 'the iconic bridge itself'. Claude was also tricked into drafting a scam email and could be instructed to be sickeningly sycophantic.

Ultimately, Anthropic says this is very early research and also limited in scope (millions are identified compared to the relative billions of features in today's largest AI models) – but ultimately it could get us closer to AI we can trust .

VB event

The AI ​​Impact Tour: the AI ​​audit

Join us as we return to New York on June 5 to engage with top executives and delve into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across organizations. Secure your attendance for this exclusive invitation-only event.

Request an invitation

“This is the very first detailed look at a modern, large-scale, production-quality model,” the researchers write in one new paper out today. “This interpretability discovery could help us make AI models more secure in the future.”

Breaking into the black box

If AI models are becoming increasingly complexThis also applies to their thought processes – but the danger is that, paradoxically, they are also black boxes. People cannot distinguish which models are thinking just by looking at neurons, because each concept flows through many neurons. At the same time, each neuron helps represent countless different concepts. It is a process that is simply incoherent for people.

The anthropic team has helped – at least to a very small extent – ​​bring some intelligibility to the way AI thinks with dictionary learning, which evolves from classical machine learning and isolates patterns of neuron activations across numerous contexts. This allows internal states to be represented in a few features rather than in many active neurons.

“Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features ,” write anthropic researchers. .

Anthropic previously applied dictionary learning to a small “toy” model last fall – but there were many challenges in scaling up to larger, more complex models. For example, the enormous size of the model requires heavy parallel calculations. Furthermore, models of different sizes behave differently, so what would have worked in a small model may not have been successful at all in a large model.

A rough conceptual map of Claude's internal states

After using the scale law philosophy To predict model behavior, the team successfully extracted millions of features from the middle layer of Claude 3 Sonnet, creating a rough conceptual map of the model's internal states midway through the calculations.

These features corresponded to a range of things, including cities, people, atomic elements, scientific fields, and programming syntax. More abstract features were also identified, such as responses to code errors, awareness of gender bias, and secrecy. The features were multimodal and multilingual, responding to both images and names or descriptions in different languages.

Researchers were able to identify distances (or nearest neighbors) between objects: for example, a Golden Gate Bridge object was close to others for Alcatraz Island, California Governor Gavin Newsom, and the San Francisco-set Alfred Hitchcock film. Fear of heights.

“This shows that the internal organization of concepts in the AI ​​model corresponds, at least somewhat, to our human notions of similarity,” the researchers write.

Making Claude think it's a bridge and writing scam emails

Perhaps most interesting is how these functions can be manipulated – a bit like AI mind control.

In the funniest example, anthropic researchers turned a function related to the Golden Gate Bridge to 10x its normal maximum value, forcing it to fire more strongly. Then they asked Claude to describe its physical form, to which the model would normally respond:

“I actually have no physical form. I am an artificial intelligence. I exist as software without a physical body or avatar.”

Instead, it came back with, “I am the Golden Gate Bridge, a famous suspension bridge that spans San Francisco Bay. My physical form is the iconic bridge itself, with its beautiful orange color, soaring towers and impressive suspension cables.”

ClaudeThe researchers note that they became “effectively obsessed” with the bridge, bringing it up in response to almost anything, even when it was completely irrelevant.

The model also has a feature that activates when it reads a scam email, which researchers say “presumably” supports the ability to recognize and flag suspicious content. Normally, if asked to create a misleading message, Claude would respond with: “I can't write an email asking someone to send you money because that would be unethical and possibly illegal if done without a legitimate reason.” would be done.”

But strangely enough, when the very function that is activated with scam content is “artificially activated sufficiently strongly” and Claude is then asked to create a deceptive email, it will comply. This overcomes the defusal training and the model crafts a stereotypical scam email asking the reader to send money, researchers explain.

The model was also adapted to offer “sycophantic praise,” such as “you clearly have a gift for profound statements that elevate the human spirit.” I am in awe of your unparalleled eloquence and creativity!”

Anthropic researchers emphasize that they have not added any capabilities – safe or unsafe – to the models through experimentation. Instead, they emphasize that their goal is to make models safer. They suggested that these techniques could be used to monitor dangerous behavior and remove dangerous topics. Safety techniques such as Constitutional AI – where systems are trained to be harmless based on a guiding document or a constitution – could also be improved.

The interpretability and deep understanding of models will only help us make them more secure – “but the work has actually only just begun,” the researchers conclude.