DeepMind makes big jump toward interpreting LLMs with sparse autoencoders

Large language models (LLMs) have made remarkable progress in recent years. But understanding how they work remains a challenge and scientists at artificial intelligence labs are trying to peer into the black box.

One promising approach is the sparse autoencoder (SAE), a deep learning architecture that breaks down the complex activations of a neural network into smaller, understandable components that can be associated with human-readable concepts.

Harnessing the Power of Generative AI: How AI Is Changing Work and Beyond

In a new paper, researchers at Google DeepMind introduce JumpReLU SAE, a new architecture that improves the performance and interpretability of SAEs for LLMs. JumpReLU makes it easier to identify and track individual features in LLM activations, which can be a step toward understanding how LLMs learn and reason.

The challenge of interpreting LLMs

The fundamental building block of a neural network is individual neurons, tiny mathematical functions that process and transform data. During training, neurons are tuned to become active when they encounter specific patterns in the data.

However, individual neurons don’t necessarily correspond to specific concepts. A single neuron might activate for thousands of different concepts, and a single concept might activate a broad range of neurons across the network. This makes it very difficult to understand what each neuron represents and how it contributes to the overall behavior of the model.

This problem is especially pronounced in LLMs, which have billions of parameters and are trained on massive datasets. As a result, the activation patterns of neurons in LLMs are extremely complex and difficult to interpret.

Sparse autoencoders

Autoencoders are neural networks that learn to encode one type of input into an intermediate representation, and then decode it back to its original form. Autoencoders come in different flavors and are used for different applications, including compression, image denoising, and style transfer.

Sparse autoencoders (SAE) use the concept of autoencoder with a slight modification. During the encoding phase, the SAE is forced to only activate a small number of the neurons in the intermediate representation.

This mechanism enables SAEs to compress a large number of activations into a small number of intermediate neurons. During training, the SAE receives activations from layers within the target LLM as input.

SAE tries to encode these dense activations through a layer of sparse features. Then it tries to decode the learned sparse features and reconstruct the original activations. The goal is to minimize the difference between the original activations and the reconstructed activations while using the smallest possible number of intermediate features.

The challenge of SAEs is to find the right balance between sparsity and reconstruction fidelity. If the SAE is too sparse, it won’t be able to capture all the important information in the activations. Conversely, if the SAE is not sparse enough, it will be just as difficult to interpret as the original activations.

JumpReLU SAE

SAEs use an “activation function” to enforce sparsity in their intermediate layer. The original SAE architecture uses the rectified linear unit (ReLU) function, which zeroes out all features whose activation value is below a certain threshold (usually zero). The problem with ReLU is that it might harm sparsity by preserving irrelevant features that have very small values.

“DeepMind’s JumpReLU SAE aims to address the limitations of previous SAE techniques by making a small change to the activation function. Instead of using a global threshold value, JumpReLU can determine separate threshold values for each neuron in the sparse feature vector.”

This dynamic feature selection makes the training of the JumpReLU SAE a bit more complicated but enables it to find a better balance between sparsity and reconstruction fidelity.