The Gradient @thegradient

Andrew Lampinen @lampinen@sigmoid.social

Research in mechanistic interpretability and neuroscience often relies on interpreting internal representations to understand systems, or manipulating representations to improve models. I gave a talk at the UniReps workshop at NeurIPS on a few challenges for this area, summary thread: 1/12
#ai #ml #neuroscience #computationalneuroscience #interpretability #NeuralRepresentations #neurips2023

Slide: exciting recent results in representational alignment... but what does it all *mean*?
Illustration: Figure from a recent survey paper: https://arxiv.org/abs/2310.13018 showing a 3 x 3 grid of illustrations from papers in cognitive science, neuroscience, and machine learning that used methods of measuring, bridging, or increasing representational alignment between different systems.

Dec 22, 2023, 02:07 AM··Web

9boosts·9favorites

**Andrew Lampinen** @lampinen · Dec 22, 2023 *

Dec 22, 2023 *

Andrew Lampinen @lampinen

Specifically, our goal (or at least mine) is to understand or improve a system’s computations; thus, these methods depend on the complex relationship between representation and computation. In the talk I highlighted a few complexities of this relationship: 2/12

Slide: So what is the relationship between representation and computation?
Illustration: array of papers on the topic, including Churland & Sejnowksi's Neural Representation and Neural Computation, and Brooks's Intelligence Without Representation. Relationship status: It's complicated (heart with question mark).

**Andrew Lampinen** @lampinen · Dec 22, 2023 *

Dec 22, 2023 *

Andrew Lampinen @lampinen

First, many analyses (regression, PCA) assume high variance representation components are most important. But equally-important features may not carry equal variance; e.g. if a model computes easy (linear) and hard (nonlinear) tasks, the easy one dominates the representations! 3/12

Slide: Multitask model representations look like easy-task-only ones, because the easy task dominates the representations.
Illustration: Comparing three models, one trained on an easy linear task, one trained on a nonlinear (XOR) task, and a multitask one trained to output both through separate outputs. Representational similarity plot shows easy task models are very similar to easy task models, hard task models are more similar to hard task models than easy ones, as expected. However, multitask models are almost indistinguishable from easy task models, even though they do both tasks.

**Andrew Lampinen** @lampinen · Dec 22, 2023 *

Dec 22, 2023 *

Andrew Lampinen @lampinen

I think several factors contribute to this, including learning dynamics favoring the easy task, and having several not-equivalent-under-linear-transform solutions to the harder task. This example is from a paper with Katherine Hermann a few years back: https://proceedings.neurips.cc/paper/2020/hash/71e9c6620d381d60196ebe694840aaaa-Abstract.html 4/12

Slide: why?
* Learning dynamics favor easier features, they are partially represented at init and learned more rapidly (conceptual illustration plot).
* Natural ways of representing a linear feature are equivalent under linear transformation; not true for the nonlinear one! (conceptual illustration of representing XOR(A,B) by different intermediate features, either (A OR B) AND (NOT (A AND B)) = XOR, or (A AND NOT B) or (B AND NOT A) = XOR; intermediate features have anticorrelated representational dissimilarity matrices.

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

The second challenge is that representation analyses that rely on simplifying a model (e.g. interpreting off of PCA) may not generalize out of distribution. In a recent paper,
@danfriedman0
explored this in a recent paper: https://sigmoid.social/@lampinen/111557051113763263
https://arxiv.org/abs/2312.03656 5/

arXiv.orgInterpretability Illusions in the Generalization of Simplified ModelsA common method to study deep learning systems is to use simplified model representations--for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplifications are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, including the Dyck balanced-parenthesis languages and a code completion task. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model. We find consistent generalization gaps: cases in which the simplified proxies are more faithful to the original model on the in-distribution evaluations and less faithful on various tests of systematic generalization. This includes cases where the original model generalizes systematically but the simplified proxies fail, and cases where the simplified proxies generalize better. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

We argue that these simplifications effectively correspond to replacing the model with a simplified proxy; for example, a model that *only* uses the top-k principle components in a given computation. Will this simplified model be faithful to the original? 6/

Slide: Model simplification = replacing model with a simplified proxy model.
illustration: Removing all but two neurons from a layer when projecting down to the top-2 principle components. Can we test whether the simplified model is a faithful proxy for the original?

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

On the training distribution, a simplified model with 4-8 PCs for attention is fairly faithful; however, on out of distribution test data it is less so! In particular, the simplified model often generalizes *worse* than the original! 7/

Line plot: how closely a simplified model with different numbers of principle components (horizontal axis) agrees with the original on different data distributions (colors). On new instances of structures that were seen in the training distribution, relatively few principle components suffice to get a faithful match. However, when testing out of distribution, there are much larger gaps between the original model and the simplified proxy; it's only a good approximation in distribution!

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

More generally, even without simplification, representational similarity does not necessarily imply computational similarity. E.g. in some cases larger models tend to have more similar representations OOD, but not more similar behaviors! 8/12

Line plot: Match between models with different embedding sizes (horizontal axis) when measured out of distribution. Representational similarity appears to increase monotonically with embedding size; however, the behavioral similarity only increases up to a relatively small size, then it starts to decrease. After this, larger models have more similar representations to one another, but less similar behavior OOD!

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

The final issue I highlighted arises when comparing static model representations to human (or animal) representations, which are contextual and dynamic. 9/12

Slide: Humans show strong effects of context, repetition, covert attention, ...
Illustration: a human looking at the same image multiple times, but generating different representations, due to repetition, attending to different features, getting distracted and thinking about what they'll eat afterward, etc.

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

This means we average away interesting features of the natural representations, and overestimate noise in the natural system (which leads to overestimating model fit). 10/

Slide: Analyses often treat variation as pure noise, so average representations, and use variability as noise ceiling for models.
Illustration: Averaging varied human representations to a "Blandest of all possible worlds representation" and then saying the model matches well because "Not that similar, but human reps are pretty noisy, so matches about as well!"

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

These challenges don't mean that representational research is hopeless of course; it's just good to be reminded of them. Indeed, neuroscience has grappled with (and discussed) these issues for many years. My goal in the talk was just to build intuitions for them. 11/

Summary:
* Computations and representations aren’t always as tightly linked as we’d like.
* These factors may cause us to overinterpret representational alignment.
* But pragmatically, this area is useful; and the issues can be tackled.
* It’s just good to occasionally be reminded of them.

**Andrew Lampinen** @lampinen · Dec 22, 2023

Dec 22, 2023

Andrew Lampinen @lampinen

If you want to learn more, we review a lot of perspectives on representational analysis, including these and other challenges, in our recent survey paper on "Getting aligned on representational alignment"
https://sigmoid.social/@lampinen/111347414419226116
https://arxiv.org/abs/2310.13018
Thanks for reading! 12/12

arXiv.orgGetting aligned on representational alignmentBiological and artificial information processing systems form representations of the world that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the similarity between the representations formed by these diverse systems? Do similarities in representations then translate into similar behavior? If so, then how can a system's representations be modified to better match those of another system? These questions pertaining to the study of representational alignment are at the heart of some of the most promising research areas in contemporary cognitive science, neuroscience, and machine learning. In this Perspective, we survey the exciting recent developments in representational alignment research in the fields of cognitive science, neuroscience, and machine learning. Despite their overlapping interests, there is limited knowledge transfer between these fields, so work in one field ends up duplicated in another, and useful innovations are not shared effectively. To improve communication, we propose a unifying framework that can serve as a common language for research on representational alignment, and map several streams of existing work across fields within our framework. We also lay out open problems in representational alignment where progress can benefit all three of these fields. We hope that this paper will catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back