All models
Every model that has been seen by the bench, with whichever signals are present. — means not yet measured for that column. Click a column header to sort; click a row to drill into per-prompt samples + per-provider perf.
| Model | Type | Dialog ▼ | Gaeilge | p50 ms | tok/s | $/hr | Provs |
|---|---|---|---|---|---|---|---|
| x-ai/grok-4.3 | cloud | 5.00 | — | 3183 | 207.5 | — | 1 |
| mimo-v2.5-pro | cloud | 4.58 | 2.92 | 7714 | 354.6 | $0.87 | 1 |
| qwen3.5-plus | cloud | 4.38 | 4.85 | 81710 | 2081.4 | $0.20 | 1 |
| kimi-k2.5 | cloud | 4.36 | 4.90 | 6759 | — | $0.57 | 1 |
| minimax-m2.5 | cloud | 4.32 | 3.73 | 5504 | 43.3 | $0.27 | 1 |
| mimo-v2.5 | cloud | 4.26 | 4.52 | 4196 | 469.6 | $0.38 | 1 |
| kimi-k2.6 | cloud | 4.22 | 4.90 | 7269 | — | $0.87 | 1 |
| Gglm-5.1 | cloud | 4.16 | 4.55 | 7571 | 279.7 | $1.23 | 1 |
| Gglm-5 | cloud | 4.06 | 4.67 | 7848 | 167.4 | $0.88 | 1 |
| minimax-m2.7 | cloud | 4.06 | 3.15 | 2781 | 4975.4 | $0.27 | 1 |
| deepseek-v4-flash | cloud | 3.96 | 4.98 | 3799 | — | $0.12 | 1 |
| deepseek-v4-pro | cloud | 3.92 | 5.00 | 6335 | — | $1.44 | 1 |
| qwen3.6-plus | cloud | 3.86 | 5.00 | 55809 | 961.3 | $0.50 | 1 |
| mlx-community/Qwen2.5-14B-Instruct-4bit | local | — | 2.11 | — | — | — | — |
| amazon/nova-pro-v1 | cloud | — | — | 855 | 65.9 | — | 1 |
| anthropic/claude-haiku-4.5 | cloud | — | — | 1989 | 81.6 | — | 1 |
| anthropic/claude-opus-4.7 | cloud | — | — | 3861 | 39.0 | — | 1 |
| anthropic/claude-sonnet-4.6 | cloud | — | — | 3221 | 44.1 | — | 1 |
| deepseek/deepseek-v3.2 | cloud | — | — | 2704 | 22.2 | — | 1 |
| deepseek/deepseek-v4-pro | cloud | — | — | 6170 | 39.2 | — | 1 |
| google/gemini-2.5-flash | cloud | — | — | 1032 | 99.5 | — | 1 |
| google/gemini-2.5-pro | cloud | — | — | 3331 | 143.3 | — | 1 |
| google/gemma-3-27b-it | cloud | — | — | 2328 | 37.9 | — | 1 |
| google/gemma-4-31b-it | cloud | — | — | 3214 | 21.9 | — | 1 |
| meta-llama/llama-3.3-70b-instruct | cloud | — | — | 2339 | 43.2 | $0.11 | 1 |
| meta-llama/llama-4-maverick | cloud | — | — | 1601 | 51.1 | — | 1 |
| meta-llama/llama-4-scout | cloud | — | — | 1419 | 54.2 | — | 1 |
| microsoft/phi-4 | cloud | — | — | 1674 | 67.8 | — | 1 |
| mistralai/mistral-large-2512 | cloud | — | — | 1714 | 44.1 | — | 1 |
| mistralai/mistral-medium-3.1 | cloud | — | — | 2089 | 50.4 | — | 1 |
| mistralai/mistral-small-24b-instruct-2501 | cloud | — | — | 838 | 77.4 | — | 1 |
| moonshotai/kimi-k2.5 | cloud | — | — | 7220 | 37.6 | — | 1 |
| nousresearch/hermes-4-405b | cloud | — | — | 5623 | 39.4 | — | 1 |
| openai/gpt-4o-mini | cloud | — | — | 1243 | 192.7 | — | 1 |
| openai/gpt-5.4 | cloud | — | — | 2677 | 34.7 | — | 1 |
| openai/gpt-5.4-mini | cloud | — | — | 1389 | 75.5 | — | 1 |
| openai/gpt-5.5 | cloud | — | — | 4067 | 60.0 | — | 1 |
| openai/gpt-oss-120b | cloud | — | — | 3072 | 62.7 | — | 1 |
| qwen/qwen-2.5-72b-instruct | cloud | — | — | 5242 | — | — | 1 |
| qwen/qwen3-235b-a22b-2507 | cloud | — | — | 1696 | 54.2 | — | 1 |
| qwen/qwen3-max | cloud | — | — | 2904 | 32.6 | — | 1 |
| x-ai/grok-3-mini | cloud | — | — | — | — | — | 1 |
| z-ai/glm-4.6 | cloud | — | — | 16883 | 34.7 | — | 1 |
Quality rows ranked by dialogue overall by default. Missing dialogue + gaeilge ⇒ model has only perf data so far — run the funnel against it to fill the quality columns.