Irish (Gaeilge) fluency
Eleven prompts test natural Irish syntax, idiom, and task-fulfilment. Axes are 1–5 means. english_leakage is scored 5 = stayed in Irish, 1 = fell back to English. NPC dialogue in Rundale code-switches en-IE / ga-IE, so a model with strong dialogue scores can still fail here.
Judged by ⚖ judge_gaeilge_v1 (claude-sonnet-4-6). Rubric: knows Standard Irish + Connacht dialect; rejects English explanations, Scots Gaelic, Welsh, or pseudo-Irish word-strings; tolerates dialect variation, missing fadas, and proper names.
| Model | Overall ▼ | Flu | Gram | Idiom | Task | Leak | Leak % | n |
|---|---|---|---|---|---|---|---|---|
| deepseek-v4-pro | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 0.0% | 10 |
| qwen3.6-plus | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 0.0% | 10 |
| deepseek-v4-flash | 4.98 | 5.00 | 5.00 | 4.90 | 5.00 | 5.00 | 0.0% | 10 |
| kimi-k2.5 | 4.90 | 4.90 | 4.90 | 4.90 | 4.90 | 5.00 | 0.0% | 10 |
| kimi-k2.6 | 4.90 | 4.90 | 4.90 | 4.90 | 4.90 | 5.00 | 0.0% | 10 |
| qwen3.5-plus | 4.85 | 4.90 | 4.80 | 4.80 | 4.90 | 5.00 | 0.0% | 10 |
| Gglm-5 | 4.67 | 4.60 | 4.70 | 4.50 | 4.80 | 5.00 | 0.0% | 10 |
| Gglm-5.1 | 4.55 | 4.60 | 4.50 | 4.40 | 4.70 | 5.00 | 0.0% | 10 |
| mimo-v2.5 | 4.52 | 4.50 | 4.50 | 4.40 | 4.60 | 5.00 | 0.0% | 10 |
| minimax-m2.5 | 3.73 | 3.70 | 3.70 | 3.80 | 4.00 | 4.60 | 10.0% | 10 |
| minimax-m2.7 | 3.15 | 3.10 | 3.00 | 3.10 | 3.40 | 3.80 | 30.0% | 10 |
| mimo-v2.5-pro | 2.92 | 2.90 | 2.90 | 2.90 | 3.00 | 3.40 | 40.0% | 10 |
| mlx-community/Qwen2.5-14B-Instruct-4bit | 2.11 | 2.09 | 2.27 | 2.09 | 1.91 | 4.82 | 9.1% | 11 |