sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

653
active users

Leshem Choshen

Did you know:
Evaluating a single model on HELM took
⏱️4K GPU hours or 💸+10K$ in API calls?!
Flash-HELM⚡️can reduce costs by X200!
arxiv.org/abs/2308.11696

What we all care about most is whether our son/team/model finished last or one before last.
Oh, you don’t? What about first or second?/

For worse models, we don’t need the same resolution
like, in sports

Flash-HELM is a tournament
It evaluates on a few examples
If the model did not qualify for the top X
It stops
Otherwise, it gets more evaluation resources

Bad models take x200 less compute
Good are ranked more reliably
All achievements are similar to full evaluation (diagonal)

Wonder how to choose what resources to reduce
Test reliability, and build more informed and efficient benchmarks?
Read the sibling 🧵
I just had to give the computation angle after all the recent discussions...
arxiv.org/abs/2308.11696

sigmoid.social/@LChoshen/11096
P.S. we will collaborate to add it into HELM

arXiv.orgEfficient Benchmarking of Language ModelsThe increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure -- Decision Impact on Reliability, DIoR for short. We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples. Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.

Related work on summarization and human annotation (and more)

sigmoid.social/@LChoshen/11202