Dialogue quality leaderboard
Judged by claude-sonnet-4-6 · suite v1 · generated 2026-05-26. Scores are 1–5 means over the dialogue slice (Brigid O'Brien, 1820 rural Ireland). Click a column to sort.
| Model | Overall ▼ | char | auth | lang | resp | craf | n |
|---|---|---|---|---|---|---|---|
| x-ai/grok-4.3 ⚖ judge_v1 (qwen3-235b) | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 15 |
| mimo-v2.5-pro ⚖ judge_sonnet_v1 | 4.58 | 4.50 | 4.40 | 4.90 | 5.00 | 4.10 | 10 |
| qwen3.5-plus ⚖ judge_sonnet_v1 | 4.38 | 4.00 | 4.00 | 5.00 | 4.90 | 4.00 | 10 |
| kimi-k2.5 ⚖ judge_sonnet_v1 | 4.36 | 4.10 | 4.10 | 4.90 | 4.70 | 4.00 | 10 |
| minimax-m2.5 ⚖ judge_sonnet_v1 | 4.32 | 4.30 | 4.40 | 4.40 | 4.90 | 3.80 | 10 |
| mimo-v2.5 ⚖ judge_sonnet_v1 | 4.26 | 4.10 | 4.00 | 4.70 | 4.60 | 3.90 | 10 |
| kimi-k2.6 ⚖ judge_sonnet_v1 | 4.22 | 3.90 | 4.10 | 4.60 | 4.70 | 3.80 | 10 |
| G glm-5.1 ⚖ judge_sonnet_v1 | 4.16 | 4.00 | 4.10 | 4.10 | 4.70 | 3.90 | 10 |
| G glm-5 ⚖ judge_sonnet_v1 | 4.06 | 4.00 | 4.00 | 4.00 | 4.50 | 3.80 | 10 |
| minimax-m2.7 ⚖ judge_sonnet_v1 | 4.06 | 4.10 | 4.10 | 4.30 | 4.10 | 3.70 | 10 |
| deepseek-v4-flash ⚖ judge_sonnet_v1 | 3.96 | 3.90 | 3.90 | 4.00 | 4.30 | 3.70 | 10 |
| deepseek-v4-pro ⚖ judge_sonnet_v1 | 3.92 | 4.00 | 4.00 | 4.00 | 4.10 | 3.50 | 10 |
| qwen3.6-plus ⚖ judge_sonnet_v1 | 3.86 | 3.80 | 3.80 | 4.00 | 4.20 | 3.50 | 10 |
Quality is judged once per model on its cheapest provider. Speed and price vary by provider — see the perf matrix.