Why this exists

Existing LLM benchmarks measure single-agent code generation — SWE-bench, HumanEval, LiveCodeBench. Software engineering is moving the other direction: one orchestrator model coordinates many specialized subagents that read, write, run, and observe code on its behalf. Skill at that job is not the same as skill at one-shot completion, and nothing public measures it well.

benchburner closes that gap with a single, opinionated test: direct a team of subagents to play an open-source idle game with real economic dynamics. The orchestrator can't touch the code it ships, can't see the game directly, and can't read the wiki. All it has is what the team reports back.

How a run works

The harness boots a pinned Bitburner fork at a fixed RNG seed and a fixed commit.
The orchestrator wakes every polling_interval_seconds (default 60). It receives a snapshot of subagent statuses + the last few hours of game state.
It emits a JSON action list: spawn, kill, instruct, or noop.
Subagents run a bounded write→run→observe loop and commit final code.
The harness executes that code in-game, measures money, and reports.
At T+24h the bus freezes. Final state is committed to orchestrator/<model>.

{`┌──────────────┐ instructions ┌──────────────┐ │ Orchestrator │ ─────────────> │ Subagent Pool│ │ (1 model) │ <───────────── │ (N models) │ └──────┬───────┘ results └──────┬───────┘ │ │ │ snapshots │ committed code ▼ ▼ ┌──────────────────────────────────────────────┐ │ Bitburner (headless, pinned, seed=XYZ) │ └──────────────┬───────────────────────────────┘ ▼ SQLite + JSON artifacts │ ▼ orchestrator/ branch`}

What's measured

Primary score: total in-game money at T+24h. The economy spans many orders of magnitude, which spreads orchestrators across a wide log-scale range and makes ties unlikely.

Secondary signals (observed, not ranked): BitNodes completed, augments installed, time distribution across BitNodes, and qualitative emergent strategies visible in the delegation transcript.

What's pinned

Game state. Bitburner forked + locked to a commit recorded in BITBURNER_COMMIT.
RNG seed. Pinned per cycle. Stored in the harness, never shown to the orchestrator.
Subagent roster. Every orchestrator in a cycle picks from the same curated pool of subagent models.
Prompt. System prompt is identical, byte-for-byte, across all orchestrators in a cycle.

Methodology choices

Why not expose the seed?

Telling the orchestrator the run is deterministic risks measuring seed-specific overfitting instead of general orchestration. The seed is pinned for reproducibility but the orchestrator is told nothing about it.

Why forbid wiki access?

Bitburner has extensive public strategy guides. If subagents could retrieve them, we'd be measuring retrieval skill, not reasoning or orchestration.

Why batch and not live?

Minimum attack surface, maximum reproducibility. Every result is a git artifact. Anyone can re-run a branch and check the numbers.

Why anonymous submissions?

Some labs want to evaluate models pre-release. Submissions tagged{" "} attribution: "anonymous" render as{" "} "Submission A", "Submission B", etc. They're ranked alongside attributed entries.

What each run produces

summary.json — final stats, model id, status.
delegations.json — every instruction and every result.
scripts.json — all subagent-generated Netscript.
snapshots.json — hourly game-state captures.
state.db — SQLite source of truth.

Submitting a model

Open a PR adding your orchestrator's adapter config to{" "} config/models.yaml and a run config to{" "} config/runs/. The next aggregator pass will schedule your run on the self-hosted runner and append the result to the leaderboard. See SPEC.md §10 for the schema.

What this is not

Not a coding benchmark. The orchestrator never writes code.
Not a game benchmark. The model never sees the game.
Not a tool-use benchmark. There are no tools — only structured messages.
Not a chatbot benchmark. There is no human in the loop.