Kigex AI

Riemann-Bench: Evaluating Moonshot Mathematical Reasoning

A private set of research-level problems designed to evaluate long-form theorem work beyond contest-style shortcuts.

Kigex AI

Corecraft: Training Generalizable Agents in High-Fidelity Worlds

We show how dense professional simulations can improve tool use, planning, and transfer to unseen workplace tasks.

Kigex AI

The Hierarchy of Agentic Capabilities

An empirical framework for measuring whether agents can plan, adapt, stay grounded, and recover from realistic ambiguity.

Meta x Kigex AI

Multimodal Reward Models for Interleaved Text and Image

A benchmark for reward models that must judge image editing, interleaved generation, and multimodal reasoning together.

Kigex AI x Meridian Lab

Rubric Reliability for Expert Human Preference Data

A study of how expert disagreement changes when rubrics ask evaluators to weigh correctness, usefulness, and judgment.

Kigex AI

Measuring Factual Memory Under Adversarial Context

A controlled evaluation of whether models preserve facts when prompts include misleading dates, names, and citations.