The Gradient @thegradient

**Leshem Choshen** @LChoshen · Nov 8, 2022

Nov 8, 2022

I don’t train from scratch, I use RoBERTa
Wait…
Why not cross-encoder/stsb-roberta?facebook/muppet-roberta?

We automatically identify the best models on (periodically)

Just pick the best one
and finetune on your task

https://ibm.github.io/model-recycling/

Model RecyclingHomeModel-recycling - the best model per architecture. Comparing finetuned models from HF, as base models for future finetuning.

#NLProc #MachineLearning #finetuning

**Leshem Choshen** @LChoshen · Nov 8, 2022

Nov 8, 2022

Leshem Choshen @LChoshen

Finetuned models are known to sometimes be better than the pretrained model.

We have found that
while 5\6 finetuned models are not good
strong models outperform the pretrained model consistently and often

(Fig. Gains of the model over the pretrained, each row is a model)

66 models finetuned on 14 diverse datasets
The bottom rows (best models on average) are mainly green representing the outperform the pretrained model on most datasets.
The top ones are mostly purple, the opposite.

**Leshem Choshen** @LChoshen · Nov 8, 2022

Nov 8, 2022

Leshem Choshen @LChoshen

Yet in practice, this knowledge is rarely applied. We tested T5, RoBERTa and BERT finetuned models (ours and @huggingface) and the best are... THE BEST

Hence, we decided to share it with the world, so people would test it for themselves:

**Leshem Choshen** @LChoshen · Nov 8, 2022

Nov 8, 2022

Leshem Choshen @LChoshen

Reposting it from the birdsite, but we are already working on improvements and will share updates soon. Please keep in touch if you have any ideas questions and comments.

Paper: https://arxiv.org/abs/2211.00107

Related work:
https://sigmoid.social/@LChoshen/109291730087194880

arXiv.orgWhere to start? Analyzing the potential value of intermediate modelsPrevious studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this \emph{intertraining} scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed \emph{independently} for the target dataset under consideration, and for a base model being considered as a starting point. This is in contrast to current perception that the alignment between the target dataset and the source dataset used to generate the base model is a major factor in determining intertraining success. We analyze different aspects that contribute to each. Furthermore, we leverage our analysis to propose a practical and efficient approach to determine if and how to select a base model in real-world settings. Last, we release an updating ranking of best models in the HuggingFace hub per architecture https://ibm.github.io/model-recycling/.