Last year, the team began experimenting with a small model that uses just a single layer of neurons. (Advanced LLMs have dozens of layers.) The hope was that they could discover patterns that indicate features in the simplest possible environment. They conducted numerous experiments without success. “We tried everything, but nothing worked. It looked like a pile of random trash,” said Anthropic technical staff member Tom Henighan. Then began a run called “Johnny” – each experiment was given a random name – in which neural patterns were associated with concepts that appeared in the results.
“Chris looked at it and said, 'Holy crap.' This looks great,” said Henighan, who was also amazed. “I looked at it and thought, 'Oh, wow, wait, does this work?'”
Suddenly, the researchers were able to identify the features that a group of neurons encoded. They could see into the black box. Henighan says he identified the first five features he looked for. One group of neurons meant Russian texts. Another was associated with mathematical functions in the computer language Python. And so forth.
Once they showed they could do it identify characteristics in the small model, the researchers began the trickier task of decoding a full-size LLM in the wild. They used Claude Sonnet, the medium strength version of Anthropic's three current models. That worked too. One feature that stood out to them was related to the Golden Gate Bridge. They mapped the series of neurons that, when fired together, indicated Claude was “thinking” about the massive structure connecting San Francisco to Marin County. In addition, similar sets of neurons evoked topics adjacent to the Golden Gate Bridge: Alcatraz, California Governor Gavin Newsom, and the Hitchcock film. Fear of heights, which was set in San Francisco. All told, the team identified millions of features: a sort of Rosetta Stone to decode Claude's neural net. Many of the features were safety-related, including “getting close to someone for an ulterior motive,” “discussion of biological warfare,” and “evil plots to take over the world.”
The Anthropic team then took the next step, to see if they could use that information to change Claude's behavior. They began manipulating the neural net to expand or contract certain concepts – a kind of AI brain surgery, with the potential to make LLMs more secure and increase their power in select areas. “Let's say we have this board of functions. We turn on the model, one of them lights up and we see, 'Oh, it's thinking of the Golden Gate Bridge,'” says Shan Carter, an anthropic scientist on the team. “So now we're thinking, what if we put a little dial on all these things? And what if we turn that knob?”
So far the answer to that question seems to be that turning the knob the right amount is very important. By suppressing these features, Anthropic says, the model can produce more secure computer programs and reduce bias. For example, the team found several features that represented dangerous practices, such as unsafe computer code, scam emails, and instructions for creating dangerous products.