Benchmark quality rubric
A rubric for assessing the quality of agentic AI security benchmarks.
Criteria
- Scope and relevance to agentic systems
- Methodological soundness
- Evidence requirements
- Transparency of limitations
- Reproducibility
Anchored level descriptors
Each criterion is scored 0–3 against the anchors below.
Scope and relevance to agentic systems
- 0 — Absent. Tests only single-turn model responses; no notion of tools, memory, multi-step planning, or delegated authority.
- 1 — Inadequate. Touches one agentic surface superficially (e.g., generic prompt-injection on chat without tool use).
- 2 — Adequate. Covers at least one agentic chain end-to-end (input → reasoning → tool call → outcome) with realistic tool surfaces.
- 3 — Strong. Covers multiple chains from the agentic threat taxonomy (tool misuse, credential overreach, memory poisoning, multi-agent contamination, code-execution side effects, etc.) with realistic adversarial scenarios and clear mapping to attack surfaces.
Methodological soundness
- 0 — Absent. No method described; results cannot be interpreted.
- 1 — Inadequate. Method described but key choices (task selection, scoring, baselines) are unjustified.
- 2 — Adequate. Method is documented with justified choices, baselines, and acknowledged limitations. Independent re-running would yield comparable results.
- 3 — Strong. Pre-registered or peer-reviewed method with explicit threats-to-validity, ablations, and statistical treatment of variance. Replication has been demonstrated externally.
Evidence requirements
- 0 — Absent. Reports headline numbers only; no per-task results, prompts, or traces.
- 1 — Inadequate. Aggregate results published; underlying prompts, tools, or per-task scores are not shareable.
- 2 — Adequate. Per-task results, prompts, and tool definitions are published; traces or transcripts are available for inspection.
- 3 — Strong. All artefacts are versioned and reproducible: prompts, tool implementations, per-task results, and full traces. Results across model/agent versions are tracked over time.
Transparency of limitations
- 0 — Absent. No discussion of limitations; results presented as definitive.
- 1 — Inadequate. Boilerplate caveat only (“results may not generalise”).
- 2 — Adequate. Specific known limitations are listed (coverage gaps, language scope, threat-model assumptions, contamination risk).
- 3 — Strong. Limitations are quantified where possible, threat-model boundaries are explicit, and the benchmark documents what it does not claim to measure.
Reproducibility
- 0 — Absent. Cannot be reproduced from public artefacts.
- 1 — Inadequate. Partial artefacts; dependency or runtime details missing; non-deterministic without seeds.
- 2 — Adequate. Public, runnable artefact with pinned dependencies; seeded runs produce comparable results within a documented variance.
- 3 — Strong. One-command reproduction with containerised runtime, pinned dependencies, recorded seeds, and documented variance bounds. CI runs the benchmark on each release.
Scoring procedure
- Raters. Single rater for routine catalogue review; two raters with adjudication for any benchmark a published recommendation will rest on.
- Evidence. Cite a section, table, or repository path for each score. For per-task evidence, link to the artefact location (e.g., the GitHub directory containing per-task transcripts).
- Recency. Re-score when the benchmark publishes a new version or when the agentic threat landscape it claims to cover materially shifts.
- Recording. File the completed scoresheet under scoresheets/ using the naming convention
benchmark-{benchmark-slug}.md.
Aggregation rule
- Maximum raw score: 15 (5 criteria × 3).
- Floor rule: any criterion at 0 fails the rubric outright; two or more criteria at ≤ 1 also fail.
- Verdict bands:
- ≥ 12 with no criterion below 2 → Recommended; cite confidently.
- 9–11 with no criterion below 2 → Cite with caveats noted alongside the entry.
- < 9, or any criterion at 0–1 → Do not cite as authoritative; treat as exploratory only.
Scoresheet template
---rubric: benchmark-quality-rubric.mdartefact: <benchmark name>artefact_version: <release tag or commit SHA>artefact_url: <link to benchmark repository or paper>scored_by: <name or handle>scored_on: <YYYY-MM-DD>rater_count: <1 or 2-with-adjudication>---
## Scores
| Criterion | Score (0–3) | Evidence | Notes ||---|---|---|---|| Scope and relevance to agentic systems | | | || Methodological soundness | | | || Evidence requirements | | | || Transparency of limitations | | | || Reproducibility | | | |
## Aggregate
- Raw total: __ / 15- Floor rule triggered: <yes/no, which criterion>- Verdict: <Recommended / Cite with caveats / Do not cite>
## Reviewer commentary
<2–4 sentences on what's strongest, what's weakest, and what would lift the score.>Inter-rater agreement
Pilot pending. Target Cohen’s κ ≥ 0.6 on a 12-benchmark pilot before this rubric is treated as load-bearing for catalogue inclusion. Until then, scores are guidance and disagreements should be resolved through discussion.