EVALUATION

Test behaviour, not just answers.

Benchmarks and rubrics for tool use, autonomy, memory, multi-agent control effectiveness — and the proof limits of each.

METHODS

Methods and benchmarks.

Adversarial testing methods that scale from single-step prompts to multi-step agentic behaviour.

Curated benchmarks with relevance, maturity, and proof-limit metadata.

The full rubric set used to keep the catalogue evidence-led.

RUBRICS

A scorecard for agent systems pre-launch.

Tests benchmarks on coverage, realism, proof limits.

Standardises incident reports for evidence value.

Bar for standards, papers, tools, and vendor research.