ModelBench
Compare LLM performance and value across providers
Best Value
Gemini 1.5 Flash
Value Score: 3367 — Best performance per dollar spent. At $0.075/1M input tokens, you get 78.9% HumanEval.
Best Performance
o1
92.4% HumanEval, 91.8% MMLU — Top scores across benchmarks, ideal for complex reasoning tasks.
PRAQTOR Edge
We factor in latency, context limits, and reliability — not just benchmark scores.
Compare apples to apples across different pricing structures.
Coding vs summarization vs chat — pick the right model for your task.
| Provider | ||||||
|---|---|---|---|---|---|---|
Gemini 1.5 FlashBest Value 👁️ Vision⚡ Functions | 78.9% | 78.9% | 86.2% | $0.075 / $0.3 | 3367 | |
Llama 3.1 8B ⚡ Functions | Meta (Together) | 72.6% | 73% | 84.5% | $0.18 / $0.18 | 3307 |
Gemini 2.0 Flash 👁️ Vision⚡ Functions | 89.7% | 83.2% | 93.1% | $0.1 / $0.4 | 2749 | |
GPT-4o Mini 👁️ Vision⚡ Functions | OpenAI | 87% | 82% | 93.2% | $0.15 / $0.6 | 1806 |
Mixtral 8x7B ⚡ Functions | Mistral (Together) | 74% | 70.6% | 74.4% | $0.6 / $0.6 | 947 |
Llama 3.1 70B ⚡ Functions | Meta (Together) | 80.5% | 86% | 94.1% | $0.88 / $0.88 | 765 |
Claude 3.5 Haiku 👁️ Vision⚡ Functions | Anthropic | 88.1% | 75.2% | 91.6% | $0.8 / $4 | 274 |
Gemini 1.5 Pro 👁️ Vision⚡ Functions | 84.1% | 85.9% | 91.7% | $1.25 / $5 | 217 | |
Llama 3.1 405B ⚡ Functions | Meta (Together) | 89% | 88.6% | 96.8% | $3.5 / $3.5 | 203 |
Mistral Large ⚡ Functions | Mistral | 84% | 84% | 91.2% | $2 / $6 | 167 |
o1 Mini | OpenAI | 92.4% | 85.2% | 94.8% | $3 / $12 | 121 |
Claude 3.5 Sonnet 👁️ Vision⚡ Functions | Anthropic | 92% | 88.3% | 96.4% | $3 / $15 | 79 |
GPT-4o 👁️ Vision⚡ Functions | OpenAI | 90.2% | 88.7% | 95.3% | $5 / $15 | 71 |
o1Top Perf 👁️ Vision | OpenAI | 92.4% | 91.8% | 96.4% | $15 / $60 | 25 |
Claude 3 Opus 👁️ Vision⚡ Functions | Anthropic | 84.9% | 86.8% | 95% | $15 / $75 | 15 |
PRAQTOR Insight
Gemini 1.5 Flash offers the best value — 78.9% HumanEval at just $0.075/1M input tokens. For maximum performance, o1 leads with 92.4% HumanEval.💡 Tip: Use tiered routing — route 80% of simple queries to Gemini 1.5 Flash and complex tasks to o1 for optimal cost-performance balance.
How We Use AI
Data Sources: Papers With Code, OpenAI, Anthropic, Google, Meta official documentation. Benchmark aggregation and Value Score calculation is automated. Raw benchmark scores are never modified — we only add context and recommendations.
Data sources: Papers With Code, OpenAI, Anthropic, Google, Meta, Mistral official documentation. Updated December 2024. Pricing in USD per 1M tokens.