AoE 2 LLM Benchmarks

How good are LLMs at crafting AoE 2 build orders?

Open Workbench Open Item Reference

Purpose

This benchmark compares how well different LLM+harness setups can produce a competitive AoE2 build-order DSL file under similar constraints.

What It Tests

This test is technically a coding optimization problem, which I feel represents pretty well what we want our agentic coders to do. It is, however, kind of a weird test, weird enough that it's likely quite out of distribution.

Setup

I give the model a simple prompt (see bottom of page), the JSON describing game data, a grammar help, and a mostly blank build order.

Disclaimer

For cost reasons, I haven't rerun these benchmarks too many time. I don't think they're _that_ high variance to be honest, but it's worth keeping in mind.

Results

These results are overall not amazing. However, all models that I could run did succesfully write a DSL script.
To be fair - the LLMs have a slightly worse interface to work with, but I would expect better.
What I find interesting is that there is a very clear skill divide on display, when all these models are "good at code".
Opus showed obvious world knowledge - its first draft was very good conceptually, too good to just be random.
Codex 5.3 exhibited severe laziness with the default prompt, always stopping quite early. High variance, as a result. On the other hand, I could then easily guide it to a much better result, because it had context left.
Partly inspired by the brillant minebench.ai

# Model + Harness Feudal Castle 10 Archers Fletching Feudal
Castle
10 Archers
Fletching
Grade Cost Trouble Build

Reproduce

This whole website is GPL3. Benchmark assets live here.

I ran the model with the same constraints and access, but pi has some issues with a few of them.

Prompt

Click to expand