Automated Interpretability Project
Overview
The Automated Interpretability project explores the fascinating world of neuron behavior in language models. This initiative provides a range of code and tools designed to generate, simulate, and evaluate explanations for the behavior of neurons in models like the popular GPT-2. The work is grounded in the principles outlined in the paper titled "Language Models Can Explain Neurons in Language Models."
Code and Tools
The project repository is packed with valuable resources:
-
Neuron Behavior Explanation Code: This code aids in creating, simulating, and scoring neuron behavior explanations. These explanations help researchers understand why neurons in models react the way they do. For comprehensive usage instructions, one can refer to the neuron-explainer README file.
-
Neuron Activation Viewer Tool: An intuitive tool is available for viewing neuron activations and understanding the accompanying explanations. This tool can be accessed online, providing an interactive way to delve into neuron activities.
If users encounter credential-related errors when accessing these tools, signing up for an Azure account may resolve the issue.
Public Datasets
In addition to tools and code, the project shares extensive datasets that can benefit researchers and developers:
-
Neuron Activations: These datasets offer insights into neuron responses, featuring tokenized text sequences and the respective activations for each neuron. The data is organized by layers and neuron indexes for easy access.
-
Neuron Explanations: This dataset includes scored model-generated explanations of neuron behavior, complete with simulation results. It allows users to see how and why neurons are activated in specific scenarios.
-
Related Neurons: A compilation of neurons with notable positive or negative connections, both upstream and downstream, is provided. This helps users explore interactions between different neurons.
-
Tokens with High Activations: Lists of tokens associated with high average activations for specific neurons offer insights into which words or symbols trigger neuron activities.
-
Tokens with Significant Weights: This dataset highlights tokens with substantial input and output weights, showing their influence on neuron behavior.
Additional datasets are available for GPT-2 Small, though these differ in methodology from those for GPT-2 XL. Users should note this when comparing results.
Recent Updates
There are a few updates to keep in mind:
-
Recent findings have pointed out a discrepancy in the inference process for GPT-2 models due to an optimized implementation of a mathematical function (GELU). This may have slightly affected activation values for GPT-2 small.
-
Definitions and details about model weights and token connections are available for those interested in diving deeper. This includes mathematical formulations for connecting neuron weights and token interactions.
Lists of Interesting Neurons
For users seeking to explore further, the project provides curated lists of interesting neurons based on various criteria. These lists include neurons deemed notable for specific behaviors or characteristics, such as sensitivity to truncation or their apparent semantic color.
For more detailed explorations, users can access external documents and spreadsheets that categorize neurons based on different metrics and preliminary descriptions.
Overall, the Automated Interpretability project opens up new avenues for understanding the mechanics behind language models, providing invaluable tools and datasets for academic and practical investigations.