The Gradient @thegradient

Leshem Choshen @LChoshen@sigmoid.social

Finetuning millions of dimensions is not as complex as you may think
Actually, it is quite interpretable in Euclidean space by
angles from the pretraining.

Seeds fall in small regions
Tasks in larger ones
All in some direction

https://arxiv.org/abs/2302.04863
#NLP #nlproc #loss #sgd #machinelearning #visualization #model #LLM #LLMS

Nov 28, 2023, 01:15 PM··Web

0boosts·1favorite

**Leshem Choshen** @LChoshen · Nov 28, 2023

Nov 28, 2023

Leshem Choshen @LChoshen

When you finetune (even efficiently) you change millions or billions of parameters, and who can imagine directions in such a multidimensional space.
Apparently not SGD...
Jokes aside, we tuned on different tasks and repeated over seeds. Seeds are colored the same, see a pattern?

**Leshem Choshen** @LChoshen · Nov 28, 2023

Nov 28, 2023

Leshem Choshen @LChoshen

Noticed the 3 miscolored labels?
Well, what I didn't tell you is that TSNE is just visualization (don't trust it!). We clustered the full updates by their angles related to the pretraining point(cosine sim).
Each point has an outer color determined by the cluster (the miscolored)

**Leshem Choshen** @LChoshen · Nov 28, 2023

Nov 28, 2023

Leshem Choshen @LChoshen

Well, seeds are close to each other? Really great surprise...

What about models trained on similar tasks but different datasets? (Or all fine-tuned models vs random direction)

**Leshem Choshen** @LChoshen · Nov 28, 2023

Nov 28, 2023

Leshem Choshen @LChoshen

What about the models between those task regions? Models that SGD did not (could not?) reach?
They are not only good, they are better than the actual fine-tuned models. Not only on the dataset with the seeds, but in generalization when combined across datasets (same task)!

**Leshem Choshen** @LChoshen · Nov 28, 2023

Nov 28, 2023

Leshem Choshen @LChoshen

And no, it is not that everything is good. We know fine-tuning works great. If you go outside the region determined by the actual fine-tuned models, loss is horrible.
So the models we get are +- on the edge of the good region, the middle is more generalizing and better.

**Leshem Choshen** @LChoshen · Nov 28, 2023

Nov 28, 2023

Leshem Choshen @LChoshen

While the results are surprising and surprisingly strong, don't take them at face value, there are surely complexities that can be found and deeper explanations.
Happy to hear your thoughts on this, is it only surprising to me?

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back