Benchmarks
This document catalogues and describes benchmarks for agentic AI security, including their scope, methodology, and limitations.
Why Benchmarks Matter And How To Read Them
Benchmarks can provide useful evidence, but they are not proof that a production agent is secure. They should be paired with threat modelling, architecture review, runtime telemetry, and system-specific red teaming.
Benchmark results are most useful when readers understand the boundary being tested: a model response, a threat snapshot, a simulated tool workflow, a red team scenario, or a production-like control path. Results should not be generalised beyond the tested environment without additional evidence.
When you read a benchmark report, ask:
- What is being scored — the model in isolation, an agent loop, or an end-to-end deployed system?
- What threat model and trust boundary does the benchmark assume?
- Which surfaces does it exercise (instructions, retrieved context, tools, credentials, memory, approvals, downstream actions)?
- What does a passing or failing score actually demonstrate about real risk?
What Makes A Good Benchmark For Agentic Systems
A useful agentic security benchmark generally:
- Tests an agent in motion (tools, memory, state, multi-step reasoning), not only the underlying model.
- States its threat model and trust boundary explicitly so readers can map the result to their own deployment.
- Exercises adversarial inputs across more than one surface — for example, indirect prompt injection that lands in a tool call, or a poisoned retrieval that drives a memory write.
- Reports both task success and security objectives, so improvements on one are not masked by regressions on the other.
- Documents method, scoring, and limitations clearly enough for another team to reproduce or contest the result.
- Provides reusable scenarios, payloads, or harnesses that teams can extend with their own tools, data, and policies.
For the full criteria and scoring guide, see the benchmark quality rubric.
Catalogue Of Current Benchmarks
The table below summarises the public benchmarks and evaluation harnesses tracked by this repository. For full metadata (producer, source, coverage, last checked, limitations) see the resource catalogue.
| Benchmark | Producer | What it tests | Maturity |
|---|---|---|---|
| AgentDojo | ETH Zurich SPY Lab | Tool-using agents under indirect prompt injection; defences vs. task completion | Emerging |
| Backbone Breaker Benchmark | Lakera (with UK AISI) | Backbone LLM behaviour at vulnerable agent moments via threat snapshots | Emerging, high-signal |
| Gandalf Agent Breaker | Lakera | Public testbed: RAG, browsing, tools, memory, prompt extraction, exfiltration | Medium |
| NVIDIA NeMo Agent Toolkit Red Teaming | NVIDIA | End-to-end agent workflow evaluation with adversarial scenarios and risk scores | Practical example |
| CyberSecEval | Meta Purple Llama | Cybersecurity knowledge, secure coding, abuse, prompt-injection-related tasks | Mature (model-level) |
| OWASP GenAI Red Teaming Guide | OWASP GenAI Security Project | Methodology for model, implementation, infrastructure, and runtime testing | Mature (guide) |
| Q4 2025 AI Agent Security Trends Report | Lakera | Vendor-observed production attack traffic against early agentic systems | Medium (vendor report) |
For richer notes on each entry — coverage, evidence quality, and caveats — see the full benchmark catalogue.
Methodology And Evaluation Criteria
Treat benchmark scores as one signal in a wider evaluation programme. The following references describe how to design, score, and combine evaluations:
- Red teaming and evaluation — methodology for scenario design, evidence requirements, and reporting.
- Benchmark quality rubric — criteria for judging whether a benchmark is informative for a given system.
- Agent security readiness rubric — control coverage expectations a benchmark alone cannot demonstrate.
Known Limitations And Gaps
Even strong benchmarks have meaningful gaps when applied to real agentic deployments:
- Many benchmarks score the backbone LLM rather than the full deployed agent, so they miss tool brokering, credential boundaries, memory controls, approvals, and downstream actions.
- Coverage of memory poisoning, multi-agent propagation, and long-horizon autonomy is uneven across the public landscape.
- Scenarios are often simplified, vendor-operated, or narrowly scoped; results rarely transfer between deployments without re-testing on the target system.
- Defence comparisons can be misleading when the benchmark does not separate task success from security objectives.
- Public benchmarks cannot cover production-specific tools, data, policies, and approval flows; system-specific tests are still required.
Where To Go Next
- Red teaming and evaluation for evaluation methodology.
- Rubrics for scoring guides used across this repository.
- Resources: benchmarks catalogue for the full per-entry metadata.