Did you know:
Evaluating a single model on HELM took4K GPU hours or
+10K$ in API calls?!
Flash-HELM️can reduce costs by X200!
https://arxiv.org/abs/2308.11696
What we all care about most is whether our son/team/model finished last or one before last.
Oh, you don’t? What about first or second?/
For worse models, we don’t need the same resolution
like, in sports
Flash-HELM is a tournament
It evaluates on a few examples
If the model did not qualify for the top X
It stops
Otherwise, it gets more evaluation resources
Bad models take x200 less compute
Good are ranked more reliably
All achievements are similar to full evaluation (diagonal)
Wonder how to choose what resources to reduce
Test reliability, and build more informed and efficient benchmarks?
Read the sibling
I just had to give the computation angle after all the recent discussions...
https://arxiv.org/abs/2308.11696
https://sigmoid.social/@LChoshen/110967533979377734
P.S. we will collaborate to add it into HELM
Related work on summarization and human annotation (and more)