Research in mechanistic interpretability and neuroscience often relies on interpreting internal representations to understand systems, or manipulating representations to improve models. I gave a talk at the UniReps workshop at NeurIPS on a few challenges for this area, summary thread: 1/12
#ai #ml #neuroscience #computationalneuroscience #interpretability #NeuralRepresentations #neurips2023
Specifically, our goal (or at least mine) is to understand or improve a system’s computations; thus, these methods depend on the complex relationship between representation and computation. In the talk I highlighted a few complexities of this relationship: 2/12
First, many analyses (regression, PCA) assume high variance representation components are most important. But equally-important features may not carry equal variance; e.g. if a model computes easy (linear) and hard (nonlinear) tasks, the easy one dominates the representations! 3/12
I think several factors contribute to this, including learning dynamics favoring the easy task, and having several not-equivalent-under-linear-transform solutions to the harder task. This example is from a paper with Katherine Hermann a few years back: https://proceedings.neurips.cc/paper/2020/hash/71e9c6620d381d60196ebe694840aaaa-Abstract.html 4/12
The second challenge is that representation analyses that rely on simplifying a model (e.g. interpreting off of PCA) may not generalize out of distribution. In a recent paper,
@danfriedman0
explored this in a recent paper: https://sigmoid.social/@lampinen/111557051113763263
https://arxiv.org/abs/2312.03656 5/
We argue that these simplifications effectively correspond to replacing the model with a simplified proxy; for example, a model that *only* uses the top-k principle components in a given computation. Will this simplified model be faithful to the original? 6/
On the training distribution, a simplified model with 4-8 PCs for attention is fairly faithful; however, on out of distribution test data it is less so! In particular, the simplified model often generalizes *worse* than the original! 7/
More generally, even without simplification, representational similarity does not necessarily imply computational similarity. E.g. in some cases larger models tend to have more similar representations OOD, but not more similar behaviors! 8/12
The final issue I highlighted arises when comparing static model representations to human (or animal) representations, which are contextual and dynamic. 9/12
This means we average away interesting features of the natural representations, and overestimate noise in the natural system (which leads to overestimating model fit). 10/
These challenges don't mean that representational research is hopeless of course; it's just good to be reminded of them. Indeed, neuroscience has grappled with (and discussed) these issues for many years. My goal in the talk was just to build intuitions for them. 11/
If you want to learn more, we review a lot of perspectives on representational analysis, including these and other challenges, in our recent survey paper on "Getting aligned on representational alignment"
https://sigmoid.social/@lampinen/111347414419226116
https://arxiv.org/abs/2310.13018
Thanks for reading! 12/12