Case study rubric

A rubric for evaluating incident and scenario case studies in agentic AI security.

Criteria

Clarity of what happened and why it matters
Attack surface and exploit path analysis
Impact and controls discussion
Evidence and references
Maturity and generalizability

Anchored level descriptors

Each criterion is scored 0–3 against the anchors below.

Clarity of what happened and why it matters

0 — Absent. The narrative is missing or incoherent; a reader cannot tell what occurred.
1 — Inadequate. A surface narrative exists but key actors, sequence, or outcome are unclear.
2 — Adequate. A reader can summarise the incident in two sentences after a single read; the why it matters is stated explicitly.
3 — Strong. Narrative is precise, sequenced, and anchored in concrete artefacts; the why it matters is tied to a named threat model entry or attack chain.

Attack surface and exploit path analysis

0 — Absent. No surfaces or paths identified.
1 — Inadequate. Surfaces named but not connected into a path; or path described without referencing surfaces.
2 — Adequate. Attack surfaces are named using repo-native vocabulary (from docs/02-attack-surfaces.md) and connected into a path that maps to one of the chains in docs/agentic-attack-chains/.
3 — Strong. Each step in the exploit path is anchored to a specific surface, the chain is named explicitly, and preconditions and trust boundaries are called out at each transition.

Impact and controls discussion

0 — Absent. Impact and controls are not discussed.
1 — Inadequate. Impact stated vaguely; controls listed generically without tying to the failure path.
2 — Adequate. Impact is concrete (data, authority, downstream system, regulatory exposure). Controls are named from patterns/ and tied to the specific points in the path where they would have interrupted the chain.
3 — Strong. Impact is quantified or scoped; controls discussion identifies which control at which point would have prevented or contained the chain, and acknowledges any control whose absence was the root cause.

Evidence and references

0 — Absent. No references; no indication whether the case is real, hypothetical, or composite.
1 — Inadequate. Sparse references; evidence level not stated; reader cannot tell what is grounded vs. inferred.
2 — Adequate. Maturity and evidence levels are explicitly stated (real / plausible / hypothetical; well-evidenced / supported / inferred). External references support the technical claims.
3 — Strong. Every load-bearing claim is referenced; the evidence level per claim is explicit; provenance is clear; the case can be independently verified.

Maturity and generalisability

0 — Absent. Cannot be generalised; reads as anecdote.
1 — Inadequate. One-off; no defensive lesson abstracted.
2 — Adequate. Generalises into a defensive lesson tied to at least one chain and one pattern; useful for teams beyond the original context.
3 — Strong. Sets a teaching example: the lesson is portable across stacks, the failure mode is named, and the case is suitable for inclusion in red-team scenario libraries or onboarding material.

Scoring procedure

Raters. Single rater for routine review; two raters with adjudication when the case is being added to the canonical case-study set in docs/09-incident-case-studies.md.
Evidence. Cite the case-study fields directly (e.g., “What happened: …” section reads …”). For external references, paste the URL or DOI in the Evidence column.
Editorial check. Verify the case follows the CONTRIBUTING.md editorial standard — calm tone, evidence-led, no unnecessary operational exploit detail.
Recording. File the completed scoresheet under scoresheets/ using the naming convention case-study-{case-slug}.md.

Aggregation rule

Maximum raw score: 15 (5 criteria × 3).
Floor rule: any criterion at 0 fails the rubric outright; two or more criteria at ≤ 1 also fail. Evidence and references at ≤ 1 is a hard fail because case studies without evidence cannot be cited safely.
Verdict bands:
- ≥ 12 with no criterion below 2 and Evidence ≥ 2 → Publish.
- 9–11 with no criterion below 2 → Publish with reviewer-noted caveats; flag the weakest dimension for a future revision.
- < 9, or floor rule triggered → Do not publish; rework or drop.

Scoresheet template

---
rubric: case-study-rubric.md
artefact: <case study title>
artefact_path: <path to case study, e.g. docs/09-incident-case-studies.md#tool-misuse-leading-to-credential-leak>
scored_by: <name or handle>
scored_on: <YYYY-MM-DD>
rater_count: <1 or 2-with-adjudication>
---

## Scores

| Criterion | Score (0–3) | Evidence | Notes |
|---|---|---|---|
| Clarity of what happened and why it matters | | | |
| Attack surface and exploit path analysis | | | |
| Impact and controls discussion | | | |
| Evidence and references | | | |
| Maturity and generalisability | | | |

## Aggregate

- Raw total: __ / 15
- Floor rule triggered: <yes/no, which criterion>
- Verdict: <Publish / Publish with caveats / Do not publish>

## Reviewer commentary

<2–4 sentences on what's strongest, what's weakest, and what would lift the score.>

Inter-rater agreement

Pilot pending. Target Cohen’s κ ≥ 0.6 on a 12-case pilot before this rubric is treated as load-bearing for canonical case-study inclusion. Until then, scores are guidance and disagreements should be resolved through discussion.