AoE 2 LLM Benchmarks
Purpose
What It Tests
This test is technically a coding optimization problem, which I feel represents pretty well what we want our agentic coders to do. It is, however, kind of a weird test, weird enough that it's likely quite out of distribution.
- Instruction following: honor the exact benchmark prompt and required scoring lines.
- Needle in a haystack: the JSON is 1000 lines and some elements are critical.
- Context rot: several model iterated their full context. This compounds the above problem.
- Out-of-distribution coding: the DSL syntax is strict and original.
- Strategic thining: it's not 100% trivial to achieve all 4 objectives and have a good build order.
- World knowledge: some models show obvious existing knowledge of typical AoE2 build orders.
Disclaimer
For cost reasons, I haven't rerun these benchmarks too many time. I don't think they're _that_ high variance to be honest, but it's worth keeping in mind.
Results
These results are overall not great - Gemini 3.1 Pro is the best, and it's easily below par. However, all models that I could run did succesfully write a DSL script.
To be fair - the LLMs have a slightly worse interface to work with, but I would expect better.
What I find interesting is that there is a very clear skill divide on display, when all these models are "good at code".
Partly inspired by the brillant minebench.ai
| # | Model + Harness | Feudal | Castle | 10 Archers | Fletching | Feudal Castle 10 Archers Fletching |
Grade | Cost | Trouble | Build |
|---|