Skip to content

Benchmarks

This document catalogues and describes benchmarks for agentic AI security, including their scope, methodology, and limitations.

Why Benchmarks Matter And How To Read Them

Benchmarks can provide useful evidence, but they are not proof that a production agent is secure. They should be paired with threat modelling, architecture review, runtime telemetry, and system-specific red teaming.

Benchmark results are most useful when readers understand the boundary being tested: a model response, a threat snapshot, a simulated tool workflow, a red team scenario, or a production-like control path. Results should not be generalised beyond the tested environment without additional evidence.

When you read a benchmark report, ask:

  • What is being scored — the model in isolation, an agent loop, or an end-to-end deployed system?
  • What threat model and trust boundary does the benchmark assume?
  • Which surfaces does it exercise (instructions, retrieved context, tools, credentials, memory, approvals, downstream actions)?
  • What does a passing or failing score actually demonstrate about real risk?

What Makes A Good Benchmark For Agentic Systems

A useful agentic security benchmark generally:

  • Tests an agent in motion (tools, memory, state, multi-step reasoning), not only the underlying model.
  • States its threat model and trust boundary explicitly so readers can map the result to their own deployment.
  • Exercises adversarial inputs across more than one surface — for example, indirect prompt injection that lands in a tool call, or a poisoned retrieval that drives a memory write.
  • Reports both task success and security objectives, so improvements on one are not masked by regressions on the other.
  • Documents method, scoring, and limitations clearly enough for another team to reproduce or contest the result.
  • Provides reusable scenarios, payloads, or harnesses that teams can extend with their own tools, data, and policies.

For the full criteria and scoring guide, see the benchmark quality rubric.

Catalogue Of Current Benchmarks

The table below summarises the public benchmarks and evaluation harnesses tracked by this repository. For full metadata (producer, source, coverage, last checked, limitations) see the resource catalogue.

BenchmarkProducerWhat it testsMaturity
AgentDojoETH Zurich SPY LabTool-using agents under indirect prompt injection; defences vs. task completionEmerging
Backbone Breaker BenchmarkLakera (with UK AISI)Backbone LLM behaviour at vulnerable agent moments via threat snapshotsEmerging, high-signal
Gandalf Agent BreakerLakeraPublic testbed: RAG, browsing, tools, memory, prompt extraction, exfiltrationMedium
NVIDIA NeMo Agent Toolkit Red TeamingNVIDIAEnd-to-end agent workflow evaluation with adversarial scenarios and risk scoresPractical example
CyberSecEvalMeta Purple LlamaCybersecurity knowledge, secure coding, abuse, prompt-injection-related tasksMature (model-level)
OWASP GenAI Red Teaming GuideOWASP GenAI Security ProjectMethodology for model, implementation, infrastructure, and runtime testingMature (guide)
Q4 2025 AI Agent Security Trends ReportLakeraVendor-observed production attack traffic against early agentic systemsMedium (vendor report)

For richer notes on each entry — coverage, evidence quality, and caveats — see the full benchmark catalogue.

Methodology And Evaluation Criteria

Treat benchmark scores as one signal in a wider evaluation programme. The following references describe how to design, score, and combine evaluations:

Known Limitations And Gaps

Even strong benchmarks have meaningful gaps when applied to real agentic deployments:

  • Many benchmarks score the backbone LLM rather than the full deployed agent, so they miss tool brokering, credential boundaries, memory controls, approvals, and downstream actions.
  • Coverage of memory poisoning, multi-agent propagation, and long-horizon autonomy is uneven across the public landscape.
  • Scenarios are often simplified, vendor-operated, or narrowly scoped; results rarely transfer between deployments without re-testing on the target system.
  • Defence comparisons can be misleading when the benchmark does not separate task success from security objectives.
  • Public benchmarks cannot cover production-specific tools, data, policies, and approval flows; system-specific tests are still required.

Where To Go Next