Symbolic Circuit Distillation: Proving LLM Circuit Equivalence

📋

Key Facts

✓ The project is named Symbolic Circuit Distillation.
✓ It targets neuron-level circuits like those in OpenAI's 'Sparse Circuits' work.
✓ The pipeline uses SMT-based bounded equivalence checking to prove program equivalence.
✓ Current tasks include quote closing and bracket-depth detection.
✓ The guarantees are bounded to finite token domains.

Quick Summary

A new interpretability project named Symbolic Circuit Distillation aims to automate the conversion of neuron-level circuits into concise Python programs. The method uses a pipeline that starts with a pruned circuit graph extracted from a transformer for specific behaviors like quote closing. It then trains a ReLU surrogate network to match the circuit on a finite domain and searches a constrained DSL to synthesize candidate programs. Finally, SMT-based bounded equivalence checking verifies that the program matches the original circuit. This approach seeks to provide machine-checkable guarantees for circuit behavior, moving beyond manual analysis.

The Distillation Pipeline

The Symbolic Circuit Distillation project introduces a four-step pipeline to automate the interpretation of neural circuits. The process begins with a pruned circuit graph for a specific behavior, such as quote closing or bracket depth, extracted from a transformer model. This circuit is treated as an executable function.

Next, a tiny ReLU network is trained to act as a 'surrogate.' This surrogate is designed to exactly match the original circuit's behavior on all inputs within a bounded domain, typically sequences of length 5 to 10 over a small token alphabet. The system then searches over a constrained Domain-Specific Language (DSL) of common transformer motifs to synthesize candidate Python programs. These motifs include counters, toggles, threshold detectors, and small state machines.

The final step utilizes SMT-based bounded equivalence checking. This technology serves two purposes: it proves that a candidate program and the surrogate agree on all inputs in the domain, or it produces a counterexample input that rules the program out. If the solver finds a proof, the result is a small, human-readable Python function accompanied by a machine-checkable guarantee that it matches the original circuit on that bounded domain.

"Mechanistic interpretability has gotten pretty good at extracting 'small crisp circuits' from large models, but turning those graphs into clean, human-readable algorithms is still very manual."
— Project Creator

Motivation and Objectives

The project was built to address a specific bottleneck in mechanistic interpretability. While this field has become proficient at extracting 'small crisp circuits' from large models, the process of turning those graph representations into clean, human-readable algorithms remains largely manual. The primary goal of Symbolic Circuit Distillation is to automate this final step.

By removing the need for manual hand-holding, the project aims to transition directly from 'here is a sparse circuit' to 'here is a verified algorithm that explains what it does.' This automation is critical for scaling interpretability efforts to larger models and more complex behaviors. The reliance on formal methods ensures that the resulting algorithms are not just guesses, but verified implementations of the circuit's logic.

Current Capabilities and Limitations

As of the latest update, the system demonstrates functionality on specific tasks. It successfully handles quote closing and bracket-depth detection tasks derived from the OpenAI circuit_sparsity repository. The pipeline achieves exact surrogate fitting on finite token domains and utilizes DSL templates for simple counters, toggles, and small state machines. The SMT-based bounded equivalence between the sparse circuit, the ReLU surrogate, and the Python program is established.

However, significant limitations remain. The guarantees provided are strictly bounded; equivalence is only proven on finite token domains consisting of short sequences and a small vocabulary. Currently, the project is focused on very small circuits. Scaling to larger circuits and longer contexts represents open engineering and research work. Additionally, the DSL is hand-designed around a few specific motifs. The creator has noted that they are not yet learning the DSL itself or employing advanced search strategies.

Future Directions and Feedback

The creator is actively seeking feedback on several aspects of the project. Specifically, they are asking if the problem framing and the bounded guarantees are interesting to those working in mechanistic interpretability or formal methods. Suggestions for next benchmarks are also requested, specifically which circuits or behaviors the community would like to see distilled next.

Feedback is also sought regarding the DSL design, search strategy, and SMT setup. The project invites questions about implementation details, the SMT encoding, and integration with existing repositories. This open approach aims to refine the tool based on community needs and expand its applicability to a wider range of neural network behaviors.

"My goal here is to automate that last step: go from 'here is a sparse circuit' to 'here is a verified algorithm that explains what it does', without hand-holding."
— Project Creator