New efficient eval results
1. A few examples are enough for Human preference to be clear, automatic metrics also don't need too many
2. Context may change which model is preferred
https://arxiv.org/abs/2402.18756
#evaluation #nlp #nlproc #ML #summarization #efival
The findings on the redundancy in the amount of examples we use (for humans) complement several recent works showing it for benchmarks:
Finding better ways to choose examples for efficiency of choosing between a pair of models
https://sigmoid.social/@LChoshen/111924841098749429
& human based efficiency (like low resource annotation and active learning things)
I am sure there are also related work about the context dependency, and other related threads I didn't know, please share (sorry for only mentioning my works, know others?)