EVALUATION
Test behaviour, not just answers.
Benchmarks and rubrics for tool use, autonomy, memory, multi-agent control effectiveness — and the proof limits of each.
METHODS
Methods and benchmarks.
METHOD
Red teaming & evaluation
Adversarial testing methods that scale from single-step prompts to multi-step agentic behaviour.
Read → CATALOGUEBenchmarks
Curated benchmarks with relevance, maturity, and proof-limit metadata.
Read → INDEXRubrics
The full rubric set used to keep the catalogue evidence-led.
Read → RUBRICS
Score before you ship.
RUBRIC
Agent security readiness
A scorecard for agent systems pre-launch.
Read → RUBRICBenchmark quality
Tests benchmarks on coverage, realism, proof limits.
Read → RUBRICCase study
Standardises incident reports for evidence value.
Read → RUBRICResource quality
Bar for standards, papers, tools, and vendor research.
Read →