Announcing the public release of the #̶N̶e̶u̶r̶I̶P̶S̶2̶0̶2̶2̶ #NeurIPS2021 () RETINA Benchmark:
A suite of tasks evaluating the reliability of uncertainty quantification methods like Deep Ensembles, MC Dropout, Parameter- and Function-Space VI, and more.
Paper: https://arxiv.org/abs/2211.12717
Code+Checkpoints: https://rebrand.ly/retina-benchmark
#NewPaper #arxiv #PaperThread below
[0/N]
[1/N] Bayesian Deep Learning has promised to improve neural network reliability on safety-critical applications, such as those in healthcare and autonomous driving.
Yet to holistically assess Bayesian Deep Learning methods, we need benchmarks on real-world tasks that reflect realistic distribution shifts, and strong uncertainty quantification baselines that capture both aleatoric and epistemic uncertainty.
[2/N] To this end, we designed RETINA, a suite of real-world tasks assessing the reliability of several established and SoTA Bayesian and non-Bayesian uncertainty quantification methods.
[3/N] We curated two public datasets of high-res human retina images exhibiting varying degrees of diabetic retinopathy, and evaluated methods on an automated diagnosis task (pictured) that requires reliable predictive uncertainty quantification.
[4/N] Our main takeaway: Uncertainty-ambivalent evaluation can be misleading. E.g., on the “Country Shift” task, models are trained on the US-sourced EyePACS dataset and evaluated out-of-domain on the Indian APTOS dataset.
Counterintuitively, when considering ROC curves, methods consistently perform better on the distributionally shifted APTOS data than in-domain (black dot is the NHS-recommended threshold for automated diagnosis).
[5/N] We use “selective prediction” to simulate automated diagnosis pipelines, computed as pictured. If a model has good uncertainty, its performance p should increase in the proportion of patients referred to a medical expert 𝛕.
[6/N] Using selective prediction, we see that model performance is significantly worse under the Country Shift.
[7/N] Another finding is that *there is no single best method*. For example, MFVI (purple) has the strongest selective prediction performance under the Country Shift (right) but the worst when evaluated in-domain (left).
[8/N] Many more experiments in the paper, including:
- Severity Shift: can models adapt to more severe cases than seen before?
- Predictive entropy histograms at each retinopathy severity level, OOD detection, ECE, class imbalance and preprocessing ablations.
[9/N] Our experiments swept over 400+ hyperparameter configurations using 100+ TPU days and 20+ GPU days (s/o to @google@mastodon.social @IntelLabs for their generous support!).
[10/N] To enable future research on reliability in safety-critical settings, the RETINA Benchmark is open-sourced as part of Uncertainty Baselines:
https://github.com/google/uncertainty-baselines
[11/N] Our codebase will allow you to reproduce experiments (we provide 100+ tuned checkpoints over 6 random seeds) and benchmark your own BDL methods for predictive performance, robustness, and uncertainty quantification (evaluation and plotting).
[12/N] For example, in “Plex: Towards Reliability using Pretrained Large Model Extensions” @dustinvtran et al.), we evaluate the performance of pretrained Vision Transformers on RETINA. (https://arxiv.org/abs/2207.07411)
[13/N] Thank you to my co-first author Tim G. J. Rudner (@timrudner), co-authors from OATML and @google@mastodon.social and the many other collaborators who made this work possible!
[N/N] @timrudner @qixuan_feng @filangelos @zacharynado @dusenberrymw @Ghassen_ML @dustinvtran @Yarin