Compare many AI models on one task—with your exact settings

Run custom prompts and parameters across dozens of models. See quality, cost, and speed side‑by‑side—then share a reproducible permalink.

Model Comparison
Blind: ON
Judges: 3
Model A
To fix the authentication bug, check if the JWT token expiration is properly validated in the middleware. Add a try-catch block around the token verification and ensure the refresh token logic handles edge cases when the access token expires during an active session.
9.21,2k€0,012142ms
Model B
The issue appears to be in your auth middleware. The JWT verification isn't handling token expiration correctly. I'd recommend: 1) Add proper error handling for expired tokens, 2) Implement token refresh logic before the token expires, 3) Clear invalid tokens from storage to prevent retry loops.
8.91,1k€0,008156ms
Model C
Your authentication middleware needs to handle JWT expiration. Wrap the token decode in try-catch, check the exp claim, and implement a token refresh mechanism. Also verify that your auth headers are being passed correctly in all API requests.
8.7980€0,006134ms

Decisions at a glance

We don't just show outputs—we surface the signals that decide the choice.

Blind: ON
Judges: 3

Who actually wins

A vs B 64%
agreement 0.78
View pairwise matrix

Pay for results, not tokens

$ per correct: 0.013
 
See cost breakdown

Fast when it matters

p50 134ms
p95 280ms
Latency details

Fewer surprises

Schema fails 0.9%
variance low
Judge rationales

Use your prompt—or grab a proven one

Start with presets, paste your own, or pick from community prompt packs.

Code Fix

Debug failing tests and generate patches

Python
JavaScript
Tests
Fix the bug in this code that causes the authentication test to fail...
Use this preset →

Q&A with sources

Answer questions using provided documents

RAG
Citations
Factual
Answer the following question using only the provided documents and cite your sources...
Use this preset →

API / tool use

Generate valid API calls and function invocations

JSON
Schema
Functions
Generate a valid API call to accomplish this task using the provided schema...
Use this preset →

Judges, not vibes.

Multiple judges with visible rationales and agreement. Turn them off anytime.

Judge Agreement

0.78
agreement
Low agreementHigh agreement

Judge Rationales

Judge A
GPT-4
9.1
Picked: Model B
Judge B
Claude-3
9.3
Picked: Model B
Judge C
Gemini Pro
8.9
Picked: Model A

How judge scoring works

Each judge evaluates responses independently using criteria like correctness, clarity, and efficiency. The ensemble aggregates their scores with visible rationales and agreement metrics.

See how scoring works

Share one report you can audit

Each run becomes a report with side-by-side outputs, scores, cost & speed, judge rationales, and pinned versions—share, embed, or export.

Shareable permalink
compare-hub.com/run/ab12c?dataset=v1.2&prompt=v3.1
Dataset v1.2
Prompt v3.1
Run #ab12c

Full comparison report

Side-by-side model outputs, scores, cost & latency breakdowns, and judge rationales—everything you need to make informed decisions.

Reproducible & shareable

Every link pins dataset, prompt, and model versions. Share with colleagues, embed in docs, or re-run months later with identical conditions.

Use it as...

Choose your role to see how our platform fits your workflow

Individuals

Find your daily model for chat, drafts, and coding help—without guessing.

Blind by default
One-click presets
Share link
Avg cost/run ≈ $0.02–$0.10
Popular 5 = GPT-4o mini, Claude Haiku, Mistral 7B, Qwen small, Llama-3.1-8B (configurable).
Model Comparison
Blind:ON
Judges:3
Model A
Quality 9.2

To fix the authentication bug, check if the JWT token expiration is properly validated in the middleware. Add a try-catch block around the token verification and ensure the refresh token logic handles edge cases when the access token expires during an active session.

1,2k·€0,012·142ms

How It Works

Streamlined process for professional model benchmarking

Compose

Pick a task or paste your prompt. Templates with variables and token counting.

Pick models

Compare many models at once. Names hidden by default until you reveal.

Judge & share

Judges evaluate by criteria. Get scores, cost & speed in a shareable permalink.

FAQ

Do you store my prompts?

By default—yes, to ensure reproducibility. Mark Private to store only metrics.

Learn more →

Can I use my own keys?

Yes—OpenRouter or direct provider keys.

Learn more →

How fair are the scores?

Multiple judges with visible agreement. You can disable judges anytime.

Learn more →

What's the max models per run?

Up to N (current cap). Names are hidden until you reveal.

Learn more →

Can I embed results?

Yes—share a permalink or use an iframe embed.

Learn more →

Can I export the numbers?

CSV/JSON export for scores, cost, and latency.

Learn more →

Join the waitlist — get trial credits & launch pricing

Free credits for early runs. Insider updates. Early-bird discounts.

Trial credits
Insider updates
Launch discounts

You'll receive credits + updates. Unsubscribe anytime. See Terms · Privacy.