Leaderboards

Greatness is not accidental. How we measure it should not be either. If we want AGI that builds enterprises, writes reliable code, and reasons through the world, we need evaluations with real stakes and expert judgment.

This is our ranking of models by rigorous reasoning, domain mastery, and usefulness under pressure.

Antidote / Everyday

Antidote: Everyday Edition

Real prompts, real stakes, graded by experts who check every citation, run every line of code, and read for actual usefulness.

Rank Model elo score (95% ci)
Orion 3.1 Pro 1099 (1087-1110)
Helios 3.5 Flash 1082 (1071-1095)
Atlas Opus 4.6 1046 (1036-1058)
Nova Reasoner Max 1045 (1032-1057)
Kite 5.5 1022 (1011-1034)
Corecraft / Enterprise

EnterpriseBench Corecraft

Multi-step enterprise workflows across customers, tickets, policies, dashboards, and tools. Agents must finish the job, not just sound plausible.

Rank Model pass rate
Atlas Opus 4.6 36.8%
Orion 3.1 Pro 34.2%
Kite 5.5 31.6%
Helios 3.5 Flash 29.4%
Vector Mini 23.9%
Research / Mathematics

Riemann-Bench

Expert-curated mathematical work that tests proof strategy, definitions, abstraction, and the ability to stay coherent over long derivations.

Rank Model score
Orion Research 74.1
Atlas Opus 4.6 71.8
Kite 5.5 66.4
Helios Flash 58.7
Multimodal / Reward

Multimodal RewardBench

Reward models judge image edits, visual reasoning, interleaved text-image outputs, and preference pairs with expert agreement.

Rank Judge accuracy
Human Experts 91.4%
Orion 3.1 Pro 79.5%
Atlas Opus 4.6 75.2%
OpenJudge 32B 64.7%
Human Judgment Score
Orion 3.1 Pro 91.8
Atlas Opus 4.6 88.9
Helios Flash 86.2
Kite 5.5 83.4
Vector Mini 76.1