Finetuning millions of dimensions is not as complex as you may think
Actually, it is quite interpretable in Euclidean space by
angles from the pretraining.
Seeds fall in small regions
Tasks in larger ones
All in some direction
https://arxiv.org/abs/2302.04863
#NLP #nlproc #loss #sgd #machinelearning #visualization #model #LLM #LLMS
When you finetune (even efficiently) you change millions or billions of parameters, and who can imagine directions in such a multidimensional space.
Apparently not SGD...
Jokes aside, we tuned on different tasks and repeated over seeds. Seeds are colored the same, see a pattern?
Noticed the 3 miscolored labels?
Well, what I didn't tell you is that TSNE is just visualization (don't trust it!). We clustered the full updates by their angles related to the pretraining point(cosine sim).
Each point has an outer color determined by the cluster (the miscolored)
Well, seeds are close to each other? Really great surprise...
What about models trained on similar tasks but different datasets? (Or all fine-tuned models vs random direction)
What about the models between those task regions? Models that SGD did not (could not?) reach?
They are not only good, they are better than the actual fine-tuned models. Not only on the dataset with the seeds, but in generalization when combined across datasets (same task)!
And no, it is not that everything is good. We know fine-tuning works great. If you go outside the region determined by the actual fine-tuned models, loss is horrible.
So the models we get are +- on the edge of the good region, the middle is more generalizing and better.
While the results are surprising and surprisingly strong, don't take them at face value, there are surely complexities that can be found and deeper explanations.
Happy to hear your thoughts on this, is it only surprising to me?