Evaluation Methodology

Evaluation methodology is the discipline of measuring policies correctly — the harnesses, metrics, statistical practices, and protocols that turn a noisy stream of rollouts into a defensible claim about capability. It is distinct from benchmarks: benchmarks fix the tasks; methodology fixes how you measure over them.

From an engineering and deployment standpoint, this is where most reproducibility failures and overclaimed results live. Small evaluation samples, hidden seed selection, mismatched simulator builds, and offline metrics that do not predict online behaviour all combine to produce numbers that look strong on paper and collapse on hardware. Robust methodology is what makes go/no-go decisions, regression checks, and external comparisons trustworthy.

When choosing tools here, prioritise statistical rigour (confidence intervals, seed counts, paired comparisons), sim-real correlation for anything you intend to deploy, and closed-loop evaluation over open-loop replay whenever the task involves control. Treat the evaluation harness as production code: version it, pin it, and require that any reported metric is reproducible from a single command.

Start here

Statistical Reliability of RL Evaluations (rliable) is the fastest way to upgrade your evaluation hygiene — confidence intervals, performance profiles, and aggregate metrics that hold up under low seed counts.

RoboArena — Decentralised real-world evaluation protocol for generalist robot policies.
robomimic — Standardised offline-RL and imitation-learning evaluation pipeline with reproducible baselines.
Bench2Drive — Closed-loop evaluation protocol for end-to-end driving policies.
Eval-vs-Train Mismatch (Kumar et al.) — Methodology paper on why offline metrics mispredict deployed robot performance.
SimplerEnv — Aligned simulator-based evaluation that correlates with real-robot performance for VLAs.
Statistical Reliability of RL Evaluations — rliable library and methodology for confidence intervals on RL benchmarks.
EvalAI — Open platform for challenge hosting, leaderboard management, and standardized evaluation workflows.
CodaLab Competitions — Reproducible benchmark and submission platform for shared evaluation protocols.
CARLA ScenarioRunner — Scenario-based closed-loop evaluation harness for safety-critical driving behaviors.
nuPlan Devkit — End-to-end planning evaluation stack with documented metrics and simulation loops.
Waymo Open Challenges — Public challenge suite with fixed protocols for forecasting and planning evaluation.
LeRobot Evaluation Scripts — Practical evaluation tooling for imitation-learning and policy-regression checks.
RoboHive — Benchmarking suite with standardized tasks and scoring across manipulation and locomotion.
Deep RL That Matters — Foundational paper on statistical pitfalls and reproducibility in RL evaluation.
Empirical Design in Reinforcement Learning — Guidance on experimental design choices that materially affect reported results.