I am creating a LLM benchmark series "PE-Bench-Python-100" to measure how good LLMs are at coding of the Project Euler Challenges compared to humans. The series will include also other languages than python, like java.
The first result so far is interesting because even the small model llama3.2 (3B) has a two-fold super-human performance (score: 2.14)
Source: https://github.com/Orbiter/project-euler-llm-benchmark/
@orbiterlab I considered building a similar thing for advent of code, but i think for both, there are just a ton of solutions in the training data - so you are probably testing memorization over actual skill.
I'd also consider async/parallel/batch processing, afaik ollama/llama.cpp supports it since 0.2. I'd assume you could get at least 10x throughput if it's well optimized.
@rnbwdsh yes, project euler is around since about 10 years and started with ca. 160 problems. I also believe that most LLMs have seen solutions, however probably mostly in python. I don't think this is too influencing to other programming languages, so benchmarks to java etc. may be more significant.
My target is to evaluate the ability of LLMs to perform in an auto-coder engine and there I would not mind if those models perform because of some 'cheating'.
@orbiterlab From my experience with nim (niche lang), most LLMs generalize pretty ok over their memory, especially with a COT prompt, where you ask it to do it in a more in-the-data lang first.
@rnbwdsh the current prompt does not use any COT on purpose because I want to measure also how much different prompt strategies influence the result, so this comes later. Templates are here: https://github.com/Orbiter/project-euler-llm-benchmark/blob/main/templates/template_python.md
@orbiterlab And I had a veeery similar idea with autocoder, since i saw the neurips autogen 0.4 workshop. It's kind of funny.
Let me guess, you also had the idea to do some "ablation" study (no CoT/deep think, no/simpler system prompt...) :D
@rnbwdsh ah yes, I just replied this. Your ideas are welcome if you want to contribute.
However my target is to use the test results in a follow-up project where I want to make a auto-coder which reads tickets and provides pull requests. So we can collect here best practices.
@orbiterlab but i'm very curious about some full results table for 10+ models. i'm rooting for deepseek/qwq.
@rnbwdsh my computer is already cooking hard and I want to make a large table. Currently the largest models are first in the queue.
I also want to measure the influence of different quantization levels, because at this time I doubt that usage of a large model with strong quantization is better than a smaller one with less quantization. I don't know any benchmark which shows that.