Which B/16 reigns supreme? I've recently fine-tuned quite a few new ViT models and wanted to compare them. With new multi-weight support on the way I realized timm will soon have ~20 different B/16 (or close to). B/16 is the most common ViT model and easiest to compare across wide range of pretrain datasets and methods. In the lead is BEiT v2, but hot on its heels are fine-tuned LAION2B and OpenAI CLIP image towers. Check out a notebook at https://colab.research.google.com/drive/12u1csH7_Uun78lGti35zvi5-S6FX4ZKu?usp=sharing #CV #machinelearning #vit #AI
This is also my first time experimenting with ImageNet-X (https://facebookresearch.github.io/imagenetx/site/home), a lot to unpack here, but I hope to explore more models in timm further with this soon... #imagenet
And finally, many of these models are already in timm, but the CLIP tower weights are currently being added. Check out the progress in https://github.com/rwightman/pytorch-image-models/pull/1520 and watch the #huggingface hub https://huggingface.co/timm