Riemann-Bench: Evaluating Moonshot Mathematical Reasoning
A private set of research-level problems designed to evaluate long-form theorem work beyond contest-style shortcuts.
A private set of research-level problems designed to evaluate long-form theorem work beyond contest-style shortcuts.
We show how dense professional simulations can improve tool use, planning, and transfer to unseen workplace tasks.
An empirical framework for measuring whether agents can plan, adapt, stay grounded, and recover from realistic ambiguity.
A benchmark for reward models that must judge image editing, interleaved generation, and multimodal reasoning together.
A study of how expert disagreement changes when rubrics ask evaluators to weigh correctness, usefulness, and judgment.
A controlled evaluation of whether models preserve facts when prompts include misleading dates, names, and citations.