AoE 2 LLM Benchmarks

How good are LLMs at crafting AoE 2 build orders?

Open Workbench Open Item Reference

Purpose

This benchmark compares how well different LLM+harness setups can produce a competitive AoE2 build-order DSL file under similar constraints.

It is designed to test practical build-order quality, not just syntax correctness. A model needs to hit the timing goals quickly, keep eco healthy, and avoid brittle scripts that collapse under simulation constraints.

What It Tests

This test is technically a coding optimization problem, which I feel represents pretty well what we want our agentic coders to do. It is, however, kind of a weird test, weird enough that it's likely quite out of distribution.

Disclaimer

For cost reasons, I haven't rerun these benchmarks too many time. I don't think they're _that_ high variance to be honest, but it's worth keeping in mind.

Results

These results are overall not great - Gemini 3.1 Pro is the best, and it's easily below par. However, all models that I could run did succesfully write a DSL script.
To be fair - the LLMs have a slightly worse interface to work with, but I would expect better.
What I find interesting is that there is a very clear skill divide on display, when all these models are "good at code".
Partly inspired by the brillant minebench.ai

# Model + Harness Feudal Castle 10 Archers Fletching Feudal
Castle
10 Archers
Fletching
Grade Cost Trouble Build

Reproduce

Benchmark assets live in benchmarks/aoe2-llm/: prompt.txt, eval.

I ran the model with the same constraints and access, but pi has some issues with a few of them.

Prompt

Click to expand