sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

588
active users

Neil Band

Announcing the public release of the (😅) RETINA Benchmark:

A suite of tasks evaluating the reliability of uncertainty quantification methods like Deep Ensembles, MC Dropout, Parameter- and Function-Space VI, and more.

Paper: arxiv.org/abs/2211.12717
Code+Checkpoints: rebrand.ly/retina-benchmark


🧵 below 👇🏾 [0/N]

[1/N] Bayesian Deep Learning has promised to improve neural network reliability on safety-critical applications, such as those in healthcare and autonomous driving.

Yet to holistically assess Bayesian Deep Learning methods, we need benchmarks on real-world tasks that reflect realistic distribution shifts, and strong uncertainty quantification baselines that capture both aleatoric and epistemic uncertainty.

[2/N] To this end, we designed RETINA, a suite of real-world tasks assessing the reliability of several established and SoTA Bayesian and non-Bayesian uncertainty quantification methods.

[3/N] We curated two public datasets of high-res human retina images exhibiting varying degrees of diabetic retinopathy, and evaluated methods on an automated diagnosis task (pictured) that requires reliable predictive uncertainty quantification.

[4/N] Our main takeaway: Uncertainty-ambivalent evaluation can be misleading. E.g., on the “Country Shift” task, models are trained on the US-sourced EyePACS dataset and evaluated out-of-domain on the Indian APTOS dataset.

Counterintuitively, when considering ROC curves, methods consistently perform better on the distributionally shifted APTOS data than in-domain (black dot is the NHS-recommended threshold for automated diagnosis).

[5/N] We use “selective prediction” to simulate automated diagnosis pipelines, computed as pictured. If a model has good uncertainty, its performance p should increase in the proportion of patients referred to a medical expert 𝛕.

[6/N] Using selective prediction, we see that model performance is significantly worse under the Country Shift.

[7/N] Another finding is that *there is no single best method*. For example, MFVI (purple) has the strongest selective prediction performance under the Country Shift (right) but the worst when evaluated in-domain (left).

[8/N] Many more experiments in the paper, including:
- Severity Shift: can models adapt to more severe cases than seen before?
- Predictive entropy histograms at each retinopathy severity level, OOD detection, ECE, class imbalance and preprocessing ablations.

[9/N] Our experiments swept over 400+ hyperparameter configurations using 100+ TPU days and 20+ GPU days (s/o to @google@mastodon.social @IntelLabs for their generous support!).

[11/N] Our codebase will allow you to reproduce experiments (we provide 100+ tuned checkpoints over 6 random seeds) and benchmark your own BDL methods for predictive performance, robustness, and uncertainty quantification (evaluation and plotting).

[12/N] For example, in “Plex: Towards Reliability using Pretrained Large Model Extensions” @dustinvtran et al.), we evaluate the performance of pretrained Vision Transformers on RETINA. (arxiv.org/abs/2207.07411)

arXiv.orgPlex: Towards Reliability using Pretrained Large Model ExtensionsA recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it improves the out-of-the-box performance and does not require designing scores or tuning the model for each task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.

[13/N] Thank you to my co-first author Tim G. J. Rudner (@timrudner), co-authors from OATML and @google@mastodon.social and the many other collaborators who made this work possible!

[N/N] @timrudner @qixuan_feng @filangelos @zacharynado @dusenberrymw @Ghassen_ML @dustinvtran @Yarin