Resource quality rubric
A rubric for assessing the quality of resources included in the agentic AI security field guide.
Criteria
- Relevance to agentic AI security
- Evidence-based claims
- Clarity and editorial quality
- Recency and ongoing relevance
- Transparency of limitations
Anchored level descriptors
Each criterion is scored 0–3 against the anchors below. The rubric applies to both external resources (standards, papers, tools) and internal field-guide resources (docs pages, chain stubs, patterns).
Relevance to agentic AI security
- 0 — Absent. Generic AI or general security material with no agentic relevance.
- 1 — Inadequate. Touches one aspect tangentially (e.g., LLM safety in general) without addressing tools, memory, credentials, multi-agent flow, or delegated authority.
- 2 — Adequate. Directly addresses one or more agentic surfaces or chains; useful to a reader of this field guide.
- 3 — Strong. Centrally about agentic execution security; addresses multiple surfaces or chains and earns a place in a reading path for the topic.
Evidence-based claims
- 0 — Absent. Claims are unsupported, speculative, or promotional.
- 1 — Inadequate. Some claims are supported, others are assertions without evidence; reader cannot tell which is which.
- 2 — Adequate. Load-bearing claims are sourced (citation, dataset, code, or named example). Distinction between empirical claim and informed opinion is clear.
- 3 — Strong. Every load-bearing claim is referenced or demonstrated; methodology behind quantitative claims is explicit; opinions are flagged as such.
Clarity and editorial quality
- 0 — Absent. Unstructured, jargon-heavy, or unreadable; key terms undefined.
- 1 — Inadequate. Readable but disorganised; key concepts buried; tone inconsistent with the field guide’s editorial standard.
- 2 — Adequate. Well-structured, precise prose; key terms defined; tone is calm and evidence-led; British English (or established AI Defense Plane spelling) where the resource is internal.
- 3 — Strong. Exceptionally clear; lands its argument quickly; cross-references its own taxonomy; could be used as an editorial model.
Recency and ongoing relevance
- 0 — Absent. Material describes systems or threats that have been superseded; recommendations are obsolete.
- 1 — Inadequate. More than two years old without re-publication; primary surfaces or threats it addresses have shifted.
- 2 — Adequate. Within two years, or older but still describes durable architecture or principles; last-checked date recorded for external resources.
- 3 — Strong. Current within twelve months, or older but maintained and re-published; explicitly tracks changes in the agentic threat landscape.
Transparency of limitations
- 0 — Absent. Presents conclusions as universal; no caveats, scope statement, or threat model.
- 1 — Inadequate. Generic disclaimer only; no specific limitation or scope boundary stated.
- 2 — Adequate. Specific limitations and scope boundaries stated (model classes, tool surfaces, deployment contexts not covered).
- 3 — Strong. Limitations are quantified or worked through with examples; the resource explicitly states what it does not claim and where its conclusions stop applying.
Scoring procedure
- Raters. Single rater for awesome-list inclusion; two raters with adjudication for resources that the field guide will rely on heavily (e.g., a pattern document, a foundational standard cited from multiple chains).
- Evidence. Cite a section, page, or paragraph for each score. For external resources, record the URL and the Last-checked date required by CONTRIBUTING.md.
- Editorial check. Verify CONTRIBUTING.md editorial standards: British English (where internal), calm tone, no copied source material, no unnecessary operational exploit detail, no overstated maturity.
- Recording. File the completed scoresheet under scoresheets/ using the naming convention
resource-{slug}.md.
Aggregation rule
- Maximum raw score: 15 (5 criteria × 3).
- Floor rule: any criterion at 0 fails the rubric outright; two or more criteria at ≤ 1 also fail. Evidence-based claims at ≤ 1 is a hard fail.
- Verdict bands:
- ≥ 12 with no criterion below 2 → Include in awesome list / cite from field guide.
- 9–11 with no criterion below 2 → Include with reviewer-noted caveats alongside the entry.
- < 9, or floor rule triggered → Do not include; rework if internal, drop if external.
Scoresheet template
---rubric: resource-quality-rubric.mdartefact: <resource title>artefact_url_or_path: <URL for external; repo path for internal>artefact_version: <publication date or commit SHA>last_checked: <YYYY-MM-DD> # for external resources onlyscored_by: <name or handle>scored_on: <YYYY-MM-DD>rater_count: <1 or 2-with-adjudication>---
## Scores
| Criterion | Score (0–3) | Evidence | Notes ||---|---|---|---|| Relevance to agentic AI security | | | || Evidence-based claims | | | || Clarity and editorial quality | | | || Recency and ongoing relevance | | | || Transparency of limitations | | | |
## Aggregate
- Raw total: __ / 15- Floor rule triggered: <yes/no, which criterion>- Verdict: <Include / Include with caveats / Do not include>
## Reviewer commentary
<2–4 sentences on what's strongest, what's weakest, and what would lift the score.>Inter-rater agreement
Pilot pending. Target Cohen’s κ ≥ 0.6 on a 12-resource pilot before this rubric is treated as load-bearing for awesome-list inclusion. Until then, scores are guidance and disagreements should be resolved through discussion.