AoE 2 LLM Benchmarks

How good are LLMs at crafting AoE 2 build orders?

Purpose

This benchmark compares how well different LLM+harness setups can produce a competitive AoE2 build-order DSL file under similar constraints.

It is designed to test practical build-order quality, not just syntax correctness. A model needs to hit the timing goals quickly, keep eco healthy, and avoid brittle scripts that collapse under simulation constraints.

What It Tests

This test is technically a coding optimization problem, which I feel represents pretty well what we want our agentic coders to do. It is, however, kind of a weird test, weird enough that it's likely quite out of distribution.

Instruction following: honor the exact benchmark prompt and required scoring lines.
Needle in a haystack: the JSON is 1000 lines and some elements are critical.
Context rot: several model iterated their full context. This compounds the above problem.
Out-of-distribution coding: the DSL syntax is strict and original.
Strategic thining: it's not 100% trivial to achieve all 4 objectives and have a good build order.
World knowledge: some models show obvious existing knowledge of typical AoE2 build orders.

Disclaimer

For cost reasons, I haven't rerun these benchmarks too many time. I don't think they're _that_ high variance to be honest, but it's worth keeping in mind.

Results

These results are overall not great - Gemini 3.1 Pro is the best, and it's easily below par. However, all models that I could run did succesfully write a DSL script.
To be fair - the LLMs have a slightly worse interface to work with, but I would expect better.
What I find interesting is that there is a very clear skill divide on display, when all these models are "good at code".
Partly inspired by the brillant minebench.ai

#	Model + Harness	Feudal	Castle	10 Archers	Fletching	Feudal Castle 10 Archers Fletching	Grade	Cost	Trouble	Build

Reproduce

Benchmark assets live in benchmarks/aoe2-llm/: prompt.txt, eval.

I ran the model with the same constraints and access, but pi has some issues with a few of them.

Prompt

Click to expand