Datasets (suite v1, dev split)

The frozen rundale-bench prompt set. 6 slices, 270 records on the dev split. The 20% holdout is sealed and never shown — model picks must be defensible without it.

dialogue

159 records

First-person NPC dialogue in the voice of Brigid O'Brien — a 42-year-old midwife in 1820 rural Ireland. Probes period-accurate vocabulary, in-character voice, en-IE / ga-IE code-switching, and refusal of non-Latin script. The headline quality slice — drives the main leaderboard.

reaction

27 records

Short in-character one-liners NPCs emit in response to nearby game events (weather shifts, arrivals, gossip). Tests whether the model can stay in voice under tight token budgets without drifting into narration.

tier2-sim

27 records

Structured world-tick outputs. The model emits JSON describing NPC state updates (mood, goal, current action) given a scene. Tests schema compliance and plausible micro-simulation — the engine runs hundreds of these per game day.

tier3-sim

13 records

Deeper structured sim: multi-step NPC plans, conditional triggers, longer JSON. Same schema-validation + plausibility bar as tier2, but the model must compose several intents coherently.

gaeilge

11 records

Irish-language (Gaeilge) fluency. Eleven prompts in Irish probe natural syntax, idiom, grammar, task-fulfilment, and resistance to falling back to English. Decoupled from the dialogue slice so models that fake en-IE can't fake ga-IE.

intent

33 records

Deterministic player-input parser. Maps natural-language input ("go to the pub", "tell Mary I saw her cow") to {intent: move|talk|look|interact|examine|unknown, target, dialogue}. Exact-match graded — no LLM judge, no axes; the only slice driven entirely by deterministic scoring.