The Gradient @thegradient

Recent searches

Search options

Only available when logged in.

Michael Christen / YaCy Search @orbiterlab@sigmoid.social

I am creating a LLM benchmark series "PE-Bench-Python-100" to measure how good LLMs are at coding of the Project Euler Challenges compared to humans. The series will include also other languages than python, like java.

The first result so far is interesting because even the small model llama3.2 (3B) has a two-fold super-human performance (score: 2.14)

Source: https://github.com/Orbiter/project-euler-llm-benchmark/

GitHubGitHub - Orbiter/project-euler-llm-benchmark: LLM Benchmark Using Project Euler For Coding ChallengesLLM Benchmark Using Project Euler For Coding Challenges - Orbiter/project-euler-llm-benchmark

Dec 25, 2024, 07:32 PM··Web

0boosts·4favorites

**Markus** @rnbwdsh@chaos.social · Dec 25, 2024

Dec 25, 2024

Markus @rnbwdsh@chaos.social

@orbiterlab I considered building a similar thing for advent of code, but i think for both, there are just a ton of solutions in the training data - so you are probably testing memorization over actual skill.

I'd also consider async/parallel/batch processing, afaik ollama/llama.cpp supports it since 0.2. I'd assume you could get at least 10x throughput if it's well optimized.

**Michael Christen / YaCy Search** @orbiterlab · Dec 25, 2024

Dec 25, 2024

Michael Christen / YaCy Search @orbiterlab

@rnbwdsh yes, project euler is around since about 10 years and started with ca. 160 problems. I also believe that most LLMs have seen solutions, however probably mostly in python. I don't think this is too influencing to other programming languages, so benchmarks to java etc. may be more significant.

My target is to evaluate the ability of LLMs to perform in an auto-coder engine and there I would not mind if those models perform because of some 'cheating'.

**Markus** @rnbwdsh@chaos.social · Dec 25, 2024

Dec 25, 2024

Markus @rnbwdsh@chaos.social

@orbiterlab From my experience with nim (niche lang), most LLMs generalize pretty ok over their memory, especially with a COT prompt, where you ask it to do it in a more in-the-data lang first.

**Michael Christen / YaCy Search** @orbiterlab · Dec 25, 2024

Dec 25, 2024

Michael Christen / YaCy Search @orbiterlab

@rnbwdsh the current prompt does not use any COT on purpose because I want to measure also how much different prompt strategies influence the result, so this comes later. Templates are here: https://github.com/Orbiter/project-euler-llm-benchmark/blob/main/templates/template_python.md

GitHubproject-euler-llm-benchmark/templates/template_python.md at main · Orbiter/project-euler-llm-benchmarkLLM Benchmark Using Project Euler For Coding Challenges - Orbiter/project-euler-llm-benchmark

**Markus** @rnbwdsh@chaos.social · Dec 25, 2024

Dec 25, 2024

Markus @rnbwdsh@chaos.social

@orbiterlab And I had a veeery similar idea with autocoder, since i saw the neurips autogen 0.4 workshop. It's kind of funny.
Let me guess, you also had the idea to do some "ablation" study (no CoT/deep think, no/simpler system prompt...) :D

**Michael Christen / YaCy Search** @orbiterlab · Dec 25, 2024

Dec 25, 2024

Michael Christen / YaCy Search @orbiterlab

@rnbwdsh ah yes, I just replied this. Your ideas are welcome if you want to contribute.

However my target is to use the test results in a follow-up project where I want to make a auto-coder which reads tickets and provides pull requests. So we can collect here best practices.

**Markus** @rnbwdsh@chaos.social · Dec 25, 2024

Dec 25, 2024

Markus @rnbwdsh@chaos.social

@orbiterlab but i'm very curious about some full results table for 10+ models. i'm rooting for deepseek/qwq.

**Michael Christen / YaCy Search** @orbiterlab · Dec 25, 2024

Dec 25, 2024

Michael Christen / YaCy Search @orbiterlab

@rnbwdsh my computer is already cooking hard and I want to make a large table. Currently the largest models are first in the queue.

I also want to measure the influence of different quantization levels, because at this time I doubt that usage of a large model with strong quantization is better than a smaller one with less quantization. I don't know any benchmark which shows that.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back