sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

594
active users

Andrew Lampinen

Research in mechanistic interpretability and neuroscience often relies on interpreting internal representations to understand systems, or manipulating representations to improve models. I gave a talk at the UniReps workshop at NeurIPS on a few challenges for this area, summary thread: 1/12

Specifically, our goal (or at least mine) is to understand or improve a system’s computations; thus, these methods depend on the complex relationship between representation and computation. In the talk I highlighted a few complexities of this relationship: 2/12

First, many analyses (regression, PCA) assume high variance representation components are most important. But equally-important features may not carry equal variance; e.g. if a model computes easy (linear) and hard (nonlinear) tasks, the easy one dominates the representations! 3/12

I think several factors contribute to this, including learning dynamics favoring the easy task, and having several not-equivalent-under-linear-transform solutions to the harder task. This example is from a paper with Katherine Hermann a few years back: proceedings.neurips.cc/paper/2 4/12

The second challenge is that representation analyses that rely on simplifying a model (e.g. interpreting off of PCA) may not generalize out of distribution. In a recent paper,
@danfriedman0
explored this in a recent paper: sigmoid.social/@lampinen/11155
arxiv.org/abs/2312.03656 5/

arXiv.orgInterpretability Illusions in the Generalization of Simplified ModelsA common method to study deep learning systems is to use simplified model representations--for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplifications are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, including the Dyck balanced-parenthesis languages and a code completion task. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model. We find consistent generalization gaps: cases in which the simplified proxies are more faithful to the original model on the in-distribution evaluations and less faithful on various tests of systematic generalization. This includes cases where the original model generalizes systematically but the simplified proxies fail, and cases where the simplified proxies generalize better. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

We argue that these simplifications effectively correspond to replacing the model with a simplified proxy; for example, a model that *only* uses the top-k principle components in a given computation. Will this simplified model be faithful to the original? 6/

On the training distribution, a simplified model with 4-8 PCs for attention is fairly faithful; however, on out of distribution test data it is less so! In particular, the simplified model often generalizes *worse* than the original! 7/

More generally, even without simplification, representational similarity does not necessarily imply computational similarity. E.g. in some cases larger models tend to have more similar representations OOD, but not more similar behaviors! 8/12

The final issue I highlighted arises when comparing static model representations to human (or animal) representations, which are contextual and dynamic. 9/12

This means we average away interesting features of the natural representations, and overestimate noise in the natural system (which leads to overestimating model fit). 10/

These challenges don't mean that representational research is hopeless of course; it's just good to be reminded of them. Indeed, neuroscience has grappled with (and discussed) these issues for many years. My goal in the talk was just to build intuitions for them. 11/

If you want to learn more, we review a lot of perspectives on representational analysis, including these and other challenges, in our recent survey paper on "Getting aligned on representational alignment"
sigmoid.social/@lampinen/11134
arxiv.org/abs/2310.13018
Thanks for reading! 12/12

arXiv.orgGetting aligned on representational alignmentBiological and artificial information processing systems form representations of the world that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the similarity between the representations formed by these diverse systems? Do similarities in representations then translate into similar behavior? If so, then how can a system's representations be modified to better match those of another system? These questions pertaining to the study of representational alignment are at the heart of some of the most promising research areas in contemporary cognitive science, neuroscience, and machine learning. In this Perspective, we survey the exciting recent developments in representational alignment research in the fields of cognitive science, neuroscience, and machine learning. Despite their overlapping interests, there is limited knowledge transfer between these fields, so work in one field ends up duplicated in another, and useful innovations are not shared effectively. To improve communication, we propose a unifying framework that can serve as a common language for research on representational alignment, and map several streams of existing work across fields within our framework. We also lay out open problems in representational alignment where progress can benefit all three of these fields. We hope that this paper will catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems.