Resource quality rubric

A rubric for assessing the quality of resources included in the agentic AI security field guide.

Criteria

Relevance to agentic AI security
Evidence-based claims
Clarity and editorial quality
Recency and ongoing relevance
Transparency of limitations

Anchored level descriptors

Each criterion is scored 0–3 against the anchors below. The rubric applies to both external resources (standards, papers, tools) and internal field-guide resources (docs pages, chain stubs, patterns).

Relevance to agentic AI security

0 — Absent. Generic AI or general security material with no agentic relevance.
1 — Inadequate. Touches one aspect tangentially (e.g., LLM safety in general) without addressing tools, memory, credentials, multi-agent flow, or delegated authority.
2 — Adequate. Directly addresses one or more agentic surfaces or chains; useful to a reader of this field guide.
3 — Strong. Centrally about agentic execution security; addresses multiple surfaces or chains and earns a place in a reading path for the topic.

Evidence-based claims

0 — Absent. Claims are unsupported, speculative, or promotional.
1 — Inadequate. Some claims are supported, others are assertions without evidence; reader cannot tell which is which.
2 — Adequate. Load-bearing claims are sourced (citation, dataset, code, or named example). Distinction between empirical claim and informed opinion is clear.
3 — Strong. Every load-bearing claim is referenced or demonstrated; methodology behind quantitative claims is explicit; opinions are flagged as such.

Clarity and editorial quality

0 — Absent. Unstructured, jargon-heavy, or unreadable; key terms undefined.
1 — Inadequate. Readable but disorganised; key concepts buried; tone inconsistent with the field guide’s editorial standard.
2 — Adequate. Well-structured, precise prose; key terms defined; tone is calm and evidence-led; British English (or established AI Defense Plane spelling) where the resource is internal.
3 — Strong. Exceptionally clear; lands its argument quickly; cross-references its own taxonomy; could be used as an editorial model.

Recency and ongoing relevance

0 — Absent. Material describes systems or threats that have been superseded; recommendations are obsolete.
1 — Inadequate. More than two years old without re-publication; primary surfaces or threats it addresses have shifted.
2 — Adequate. Within two years, or older but still describes durable architecture or principles; last-checked date recorded for external resources.
3 — Strong. Current within twelve months, or older but maintained and re-published; explicitly tracks changes in the agentic threat landscape.

Transparency of limitations

0 — Absent. Presents conclusions as universal; no caveats, scope statement, or threat model.
1 — Inadequate. Generic disclaimer only; no specific limitation or scope boundary stated.
2 — Adequate. Specific limitations and scope boundaries stated (model classes, tool surfaces, deployment contexts not covered).
3 — Strong. Limitations are quantified or worked through with examples; the resource explicitly states what it does not claim and where its conclusions stop applying.

Scoring procedure

Raters. Single rater for awesome-list inclusion; two raters with adjudication for resources that the field guide will rely on heavily (e.g., a pattern document, a foundational standard cited from multiple chains).
Evidence. Cite a section, page, or paragraph for each score. For external resources, record the URL and the Last-checked date required by CONTRIBUTING.md.
Editorial check. Verify CONTRIBUTING.md editorial standards: British English (where internal), calm tone, no copied source material, no unnecessary operational exploit detail, no overstated maturity.
Recording. File the completed scoresheet under scoresheets/ using the naming convention resource-{slug}.md.

Aggregation rule

Maximum raw score: 15 (5 criteria × 3).
Floor rule: any criterion at 0 fails the rubric outright; two or more criteria at ≤ 1 also fail. Evidence-based claims at ≤ 1 is a hard fail.
Verdict bands:
- ≥ 12 with no criterion below 2 → Include in awesome list / cite from field guide.
- 9–11 with no criterion below 2 → Include with reviewer-noted caveats alongside the entry.
- < 9, or floor rule triggered → Do not include; rework if internal, drop if external.

Scoresheet template

---
rubric: resource-quality-rubric.md
artefact: <resource title>
artefact_url_or_path: <URL for external; repo path for internal>
artefact_version: <publication date or commit SHA>
last_checked: <YYYY-MM-DD>  # for external resources only
scored_by: <name or handle>
scored_on: <YYYY-MM-DD>
rater_count: <1 or 2-with-adjudication>
---

## Scores

| Criterion | Score (0–3) | Evidence | Notes |
|---|---|---|---|
| Relevance to agentic AI security | | | |
| Evidence-based claims | | | |
| Clarity and editorial quality | | | |
| Recency and ongoing relevance | | | |
| Transparency of limitations | | | |

## Aggregate

- Raw total: __ / 15
- Floor rule triggered: <yes/no, which criterion>
- Verdict: <Include / Include with caveats / Do not include>

## Reviewer commentary

<2–4 sentences on what's strongest, what's weakest, and what would lift the score.>

Inter-rater agreement

Pilot pending. Target Cohen’s κ ≥ 0.6 on a 12-resource pilot before this rubric is treated as load-bearing for awesome-list inclusion. Until then, scores are guidance and disagreements should be resolved through discussion.