Antidote: Everyday Edition
Real prompts, real stakes, graded by experts who check every citation, run every line of code, and read for actual usefulness.
Greatness is not accidental. How we measure it should not be either. If we want AGI that builds enterprises, writes reliable code, and reasons through the world, we need evaluations with real stakes and expert judgment.
This is our ranking of models by rigorous reasoning, domain mastery, and usefulness under pressure.
Real prompts, real stakes, graded by experts who check every citation, run every line of code, and read for actual usefulness.
Multi-step enterprise workflows across customers, tickets, policies, dashboards, and tools. Agents must finish the job, not just sound plausible.
Expert-curated mathematical work that tests proof strategy, definitions, abstraction, and the ability to stay coherent over long derivations.
Reward models judge image edits, visual reasoning, interleaved text-image outputs, and preference pairs with expert agreement.