We curated and analysed thousands of benchmarks -- to better understand the (mis)measurement of AI!
We cover all of #NLProc and #ComputerVision.
Now live at Nature Communications: https://nature.com/articles/s41467-022-34591-0
Benchmarks are crucial to measuring and steering AI progress.
Their number has become astounding.
Each has unique patterns of activity, improvement and eventual stagnation/saturation. Together they form the intricate story of global progress in AI.
We found a sizable portion of benchmarks have kind of reached saturation ("can't get better than this") or stagnation ("could get better, but we don't know how / nobody tries"). But still a lot of dynamic benchmarks as well!
How does benchmark activity and improvement develop over time and different domains? We mapped all data into an #RDF #KnowledgeGraph / ontology and devised novel, highly condensed visualisation methods.
Most benchmark datasets are unpopular.
Traits correlated with popularity:
- versatile (cover more tasks, have more sub-benchmarks)
- have a dedicated leaderboard
- be created by people from top-institutions
The biggest obstacle and limitation for our work is data availability.
This analysis was only possible by using data from the fabulous Papers with Code project... (shoutout to @rstojnic )
As a community, we should incentivize depositing results in Papers with Code more! Lots of potential added value.