Awesome-Agentic-Engineering

# 🧠 Awesome Agentic Engineering > **Stop prompting. Start engineering. A structured reference for taking AI agents into production.** [![Awesome](https://awesome.re/badge-flat2.svg)](https://awesome.re) [![Last Commit](https://img.shields.io/github/last-commit/natnew/Awesome-Agentic-Engineering?label=last%20updated&style=flat-square)](https://github.com/natnew/Awesome-Agentic-Engineering/commits) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com) [![License: MIT](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=flat-square)](LICENSE) A curated map of agentic AI systems — covering architectures, frameworks, memory, evaluation, and safety.

This is not a tool list. It's a structured guide to building **reliable, observable, production-grade agentic systems**, evaluated against rigorous engineering dimensions.

📑 Table of Contents

Audience: all contributors · Evidence class: mixed

🧭 Thesis
🧱 Agentic Engineering Reference Stack
⚖️ Architecture Decision Guide
🧩 Core Agentic Patterns
🏗️ Reference Architectures
📐 Spec-Driven Development
🧠 Memory Systems
📊 Formal Evaluation Rubric
Benchmark and Evidence Policy
⚙️ Orchestration Frameworks
📡 Protocols and Standards
🛂 Agent Authority, Identity & Delegation
🧭 Reasoning & Planning Models
🧪 Evaluation & Safety
🧠 Skills and Operating Principles
🚫 What NOT to Do
📊 Signals (How to Read This List)
🚀 Getting Started
🤝 Contributing
📌 Final Note

📂 Appendix

🧭 Thesis

Audience: all contributors · Evidence class: mixed

📈 The Shift (Agentic systems are moving to)	📉 The Challenge (Implementations suffer from)	🎯 Our Focus (This repository prioritises)
• Stateful, multi-step reasoning • Multi-agent collaboration & orchestration • Feedback-driven learning loops • Tool-augmented execution environments	• Fragility under iteration • Poor observability & evaluation • Weak memory & context management • Limited safety & governance	• Reliability over novelty • Evaluation over intuition • Architecture over tooling • Systems thinking over prompt engineering

🧱 Agentic Engineering Reference Stack

Audience: practitioners · Evidence class: mixed

A serious agentic system has nine engineering layers between a user goal and a running production system. Most production failures are a missing layer, not a bad model.

flowchart TD
    A["User / System Goal"] --> B["Spec / Intent / Task Definition"]
    B --> C["Planner / Decomposition"]
    C --> D["Agent Runtime / Control Loop"]
    D --> E["Tools / APIs / External Systems"]
    E --> F["Memory / Context / State"]
    F --> G["Evaluation / Tests / Rubrics / Guards"]
    G --> H["Observability / Tracing / Monitoring"]
    H --> I["Governance / Identity / Permissions / Human Approval"]
    I --> J["Deployment / Runtime Operations"]

Tier	Layers	What the tier is for
Intent and execution	Spec · Planner · Agent Runtime	Define what the system should do and how it does it.
World-binding	Tools · Memory · Evaluation	Connect the system to data, history, and ground truth.
Operability	Observability · Governance · Deployment	Keep it inspectable, accountable, and operable.

Skip a tier and the layers above it become indefensible.

→ Full reference, with per-layer engineering concerns and failure modes: Agentic Engineering Reference Stack.

⚖️ Architecture Decision Guide

Audience: practitioners · Evidence class: mixed

If your task is…	Start with…	Escalate to…	Avoid…
bounded, tool-using, low-risk	single-agent + tools	typed state, retries	multi-agent teams
long-running, inspectable, enterprise	graph/workflow orchestration	approval gates, persistence	opaque emergent loops
open-ended research	planner/executor or supervisor	critique loops, memory	rigid pipelines only
high-reliability extraction	prompt chains + strict schemas	validator feedback loops	unconstrained conversational agents
complex parallel execution	modular multi-agent setups	shared workspace/memory	treating LLMs as deterministic

Authority rule: any agent that can act outside its own runtime — for example by calling APIs, modifying files, opening pull requests, sending messages, triggering workflows, or accessing private data — needs an explicit identity, scoped authority, audit trail, and revocation path. See Agent Authority, Identity & Delegation.

🧩 Core Agentic Patterns

Audience: practitioners · Evidence class: mixed

These patterns underpin most production-grade agentic systems.

Pattern	Description	Key Characteristic
Single-Agent + Tool Use	One reasoning loop with structured tool invocation	Suited to focused tasks with bounded scope
Supervisor / Router Agents	Central agent delegates tasks to specialised agents	Enables modularity and scalability
Multi-Agent Collaboration	Agents operate in parallel or sequence	Patterns: debate, critique, planning/execution split
Reflection / Critique Loops	Agents evaluate and refine their own outputs	Improves reliability over multiple iterations
Retrieval-Augmented Agents	External knowledge via vector search or APIs	Reduces hallucination and improves grounding
Event-Driven / Long-Running Agents	Persistent agents reacting to triggers over time	Requires memory, state, and orchestration

🏗️ Reference Architectures

Audience: practitioners · Evidence class: mixed

Representative system designs for real-world use.

Architecture	Ecosystem Maturity	Description	Architectural Strengths	Operational Constraints	Workload Suitability	Design Paradigm	Governance Fit
DeerFlow	Emerging	Is: Open-source orchestration system combining sub-agents, memory, and sandboxes. Demonstrates: Workflow-oriented orchestration across agents with shared execution context.	Strong system-level reference for memory, sandbox, and skills composition.	Higher setup complexity and a heavier runtime surface than most teams need initially.	Strong fit for compound research/coding workflows and teams studying full-stack agent architectures. Poor fit for lightweight orchestration or narrowly scoped tasks.	Hierarchical multi-agent orchestration.	Requires explicit sandbox policy, tool boundaries, and operator oversight before untrusted code execution.
SWE-agent	Experimental	Is: Autonomous SWE system using a specialized Agent-Computer Interface (ACI). Demonstrates: Narrow action spaces and interface design tuned for code-repair tasks.	Streamlined command space, compressed history handling, and a clear task boundary for patch workflows.	Benchmark-oriented design, high token cost, and long end-to-end fix latency on larger tasks.	Strong fit for isolated PRs and self-contained bug fixes. Poor fit for broad refactors or environments without standard build tooling.	Single agent with a highly specialized action space (ACI).	Needs tight repository scoping, review gates, and execution controls to reduce silent code regressions.

📐 Spec-Driven Development

Audience: practitioners · Evidence class: mixed

Last reviewed: April 2026.

Agentic systems amplify whatever intent you feed them — including vague intent. Spec-driven development (SDD) treats the specification as the load-bearing artifact: a durable, reviewable document that describes what the system should do and how it should behave, from which plans, code, and tests are generated (and regenerated) by agents. It is the production-grade answer to “vibe coding.”

In an agentic context, the spec does three things at once:

Anchors intent — a typed, versioned contract the agent (and humans) refer back to across long sessions.
Defines the acceptance surface — plans, tasks, and tests are derived from the spec, not improvised per prompt.
Makes re-generation safe — regenerating code from an updated spec is cheaper and more reviewable than patching drift.

Core practices

Practice	What it means	Why it matters for agents
Spec before plan before code	Write a scoped spec (problem, constraints, acceptance criteria) before any plan or implementation. Plans and code are generated from the spec.	Agents behave better against a fixed target than against a shifting prompt.
Executable specs	Encode acceptance criteria as runnable checks (tests, evals, schema validators) alongside prose.	Lets agents self-verify and lets CI reject regressions without human review on every step.
Typed contracts at boundaries	Specify tool signatures, state shape, and I/O schemas with types (Pydantic, JSON Schema, TypeSpec).	Narrows the action space the agent can hallucinate into.
Review the spec, not the diff	Human review focuses on the spec and acceptance checks; the diff is a consequence.	Makes agent-authored PRs tractable at volume.
Versioned and diffable	Specs live in the repo, are PR-reviewed, and evolve with the code.	Gives rollback, blame, and audit trail — same hygiene as code.
One spec, many artifacts	Generate plans, tasks, tests, and docs from the same spec.	Keeps planner, actor, and verifier aligned.

Resources

Evidence tags follow the Benchmark and Evidence Policy.

Resource	Role	Description	Evidence
GitHub Spec Kit	Toolkit / methodology	Open-source toolkit for spec-driven development with agentic coding assistants (Copilot, Claude Code, Cursor, Gemini CLI). Defines the `/specify` → `/plan` → `/tasks` → `/implement` workflow used in this repo’s own `specs/` directory.	`[official]` repo · `[official]` announce
Kiro	IDE	AWS IDE built around spec-driven development: specs, steering files, and hooks drive agent work from requirements through tasks. First-party reference implementation of SDD in an IDE.	`[official]` docs · `[field report]` AWS launch post
OpenAI Model Spec	Behavioural spec	First-party example of treating model behaviour as a versioned, public spec — objectives, rules, defaults, and conflict resolution. A reference for how to write a spec an agent can actually be aligned to.	`[official]` spec · `[official]` post
AGENTS.md	Project-level agent spec	Simple convention for a repository-scoped file that instructs coding agents about build, test, style, and conventions. Widely supported across agent CLIs.	`[official]` site
Anthropic Claude Skills (SKILL.md)	Skill-level spec	Declarative, self-contained skill specs (`SKILL.md`) that package instructions, tools, and examples agents can discover and load on demand. Treats individual capabilities as versioned spec artifacts.	`[official]` docs
Pydantic AI	Typed contracts	Python framework that makes schema-first I/O the default for LLM calls — the practical form of “typed contracts at boundaries” for agent code.	`[official]` docs

Where SDD applies

SDD is not limited to new projects or a single team. The spec becomes a portable artifact in three directions:

Context	How SDD applies	Notes
Greenfield projects	Write the spec first; agents scaffold the repo, tests, and initial implementation from it.	Easiest case — no legacy constraints; the spec defines the system boundary.
Brownfield projects	Reverse-engineer specs from existing code and behaviour, then use them as the contract for future agent-authored changes.	Start narrow (one module or flow), treat the spec as the accepted behaviour, and expand coverage incrementally. Agents modify against the spec, not the full legacy codebase.
Shared across orgs	Specs, prompts, evals, and skill packs (`SKILL.md`, `AGENTS.md`, prompt libraries) are repo-level artifacts that can be open-sourced, forked, and re-used — like shared test suites or style guides.	Treat prompts and evals as first-class, versioned assets; publish them alongside code so research, patterns, and hard-won lessons compound across teams rather than staying trapped in one org.

How this repo uses SDD: the specs/ directory contains phased specs (requirements → plan → validation) generated and executed against with Spec Kit. The tasks/todo.md, phase validation scripts, and PR bodies are derived artifacts. See CONTRIBUTING.md for the contributor-facing workflow.

🧠 Memory Systems

Audience: practitioners · Evidence class: mixed

Last reviewed: April 2026.

Memory is a first-class concern in agentic systems. Rather than treating memory as a simple array of previous messages, production systems require structured approaches to state, persistence, retrieval, and experience reuse. Four categories — working, episodic, procedural, semantic — remain the core architectural choices, but recent frontier research also shows memory is increasingly being used to improve future agent behaviour, not merely to store past context.

Memory Taxonomy

Different types of memory serve distinct functional roles in an agentic architecture:

Type	Definition	Implementation Examples
Working Memory (Thread State)	Short-term context for the current execution loop or active conversation thread. Ephemeral.	Context window, LangGraph `State`, in-memory message lists.
Episodic Memory	Autobiographical history of past actions, inputs, and outcomes. Enables reflection on past mistakes.	Checkpoint logs, event stores, prompt / trajectory histories.
Procedural Memory	Reusable skills, system prompts, and tool configurations. Defines how the agent operates.	Static configuration, retrieved skill libraries, GitHub workflows.
Semantic Memory	Embedded, factual knowledge about the world, the user, or the domain. Defines what the agent knows.	Vector databases (FAISS, Pinecone), knowledge graphs, Letta core memory.

Frontier Research: Memory Beyond Storage

Recent research shows frontier agent systems moving beyond simple retrieval towards experience transformation: converting prior trajectories, workflows, and reasoning patterns into reusable guidance for future tasks. Memory becomes part of the learning loop, not just the context pipeline.

System	Best Fit in Taxonomy	Why it matters	Evidence
Agent Workflow Memory (AWM)	Episodic + Procedural	Induces reusable workflows from prior experience and selectively retrieves them to guide future generations; improves long-horizon web-agent tasks in both offline and online settings.	`[official]` repo · `[benchmark]` paper
Synapse	Episodic	Stores exemplar trajectories as memory and retrieves them via similarity search, using complete state–action histories (not shallow few-shot examples) to improve multi-step computer control.	`[official]` site · `[benchmark]` paper
ReasoningBank	Episodic + Procedural (semantic-adjacent)	Distils generalisable reasoning strategies from self-judged successful and failed experiences, then retrieves and updates them over time so the agent improves through continued interaction.	`[benchmark]` paper

Architectural Patterns: Shared vs. Private Memory

In multi-agent systems, memory boundaries are architectural decisions:

Private Agent Memory: Each agent maintains its own semantic and episodic stores. Prevents context leakage and maintains strong role boundaries.
Shared Workspace (Global Memory): A common blackboard or shared state where multiple agents read and write. Requires collision management and strict typing.

Retrieval and Persistence Decisions

Managing the memory lifecycle is critical for long-running agents.

Mechanism	Description	Best Practices & Risks
Checkpointing	Saving the exact thread state at a specific point in time (e.g., node transitions).	Enables “time travel” (rewind and replay) and human-in-the-loop approvals.
Write Policies	Rules defining when and how an agent commits data to long-term storage.	Prefer explicit `SaveMemory` tool calls over passive auto-saving to maintain control.
Retrieval Triggers	Determining when to query past memory (e.g., pre-fetch vs. just-in-time).	Use vector search for semantic recall, but use explicit graph keys for structured state.
Summarisation / Compression	Reducing token counts of episodic histories.	Summarise older interactions into a rolling summary while preserving recent exact messages.
Pruning / Decay	Deleting or archiving old or irrelevant memories.	Implement TTL (time-to-live) for working memory to prevent context exhaustion.
Contamination / Poisoning	Malicious or incorrect data persisting in long-term memory.	Risk: Once poisoned, an agent’s future logic breaks. Require validation or bounds on semantic writes.

Systems and Protocols

Specialised infrastructure for managing agent memory.

System	Role	Description
LangGraph Persistence	Thread-level state	Built-in check-pointers (SQLite, Postgres) for DAG-based execution loops, enabling interrupt/resume.
LangMem	Long-term memory extraction	LangChain’s framework for extracting user preferences and entity profiles in the background.
Letta (formerly MemGPT)	OS-level memory abstraction	Advanced core memory management with explicit paging (read/write limits) to mimic virtual memory.
Mem0	Personalized memory layer	Managed memory API focusing on user contexts, interactions, and entity relationships.
Zep / Graphiti	Enterprise memory & graphs	Fast, long-term memory for AI assistants; uses temporal knowledge graphs to map entity relationships over time.
MCP (Model Context Protocol)	Interoperability fabric	While not a DB itself, MCP provides a standard protocol to expose memory stores and file systems universally across tools and agents.

Design implication: the key question is no longer only what the agent remembers, but how memory changes future behaviour. Systems like Synapse, Agent Workflow Memory, and ReasoningBank signal a shift — memory is becoming part of the agent’s learning loop, enabling reusable routines and self-improvement over time.

📊 Formal Evaluation Rubric

Audience: maintainers · Evidence class: mixed

🎯 Evaluation principle: rubrics assess quality, test suites verify behaviour, assertions enforce invariants, and LLM-as-a-judge is used only in tightly scoped regression tests.

Every major framework and architecture in this repository is judged against the following Required Scoring Dimensions. We evaluate systems based on engineering rigor, not marketing copy.

Dimension	Evaluation Criteria
Control flow explicitness	How observable and deterministic is the execution path?
State model	How is agent state typed, managed, and persisted?
Memory support	Are there built-in primitives for short-term, episodic, and semantic memory?
Observability / tracing	Is it easy to trace intermediate reasoning steps and tool calls?
Human-in-the-loop support	Does it natively support interrupt-and-resume or approval gates?
Type safety / structured outputs	Are outputs guaranteed against strict schemas?
Provider portability	How tightly coupled is it to one specific LLM provider?
Security posture	Are there built-in mechanisms for sandboxing, access control, or guardrails?
Architectural strengths	Which design choices materially improve decomposition, control, state handling, or interface clarity?
Operational constraints	What deployment burden, runtime cost, debugging friction, or failure modes does it introduce?
Ecosystem maturity	How stable are the APIs, docs, integrations, and operator knowledge base?
Governance fit	Does it support auditability, approval gates, access boundaries, policy enforcement, and regulated environments?
Workload suitability	Which workflows, task shapes, and team contexts does it fit well or poorly?

Benchmark and Evidence Policy

Audience: maintainers · Evidence class: official

Canonical resources are trusted here because they define what counts as evidence. Prefer official docs, architecture guides, papers, benchmark repos, and first-party repositories when establishing capabilities, methodology, or interface details.

Evidence Tag	Use For
`[official]`	Official docs, architecture guides, specifications, benchmark documentation, or first-party repositories.
`[benchmark]`	Published benchmark runs, evaluation papers, or benchmark repos tied to a named workload.
`[field report]`	Production write-ups, incident reports, engineering blogs, or operator notes about real deployments.
`[author assessment]`	This repository’s synthesis after reviewing the sources above and applying the rubric.

Do not treat marketing copy, launch-day demos, or GitHub stars as sufficient evidence for production claims.
Separate benchmark performance from production maturity. A benchmark result can support workload fit, but it does not by itself prove reliability, governance fit, cost control, or operational maturity.
Record Last reviewed: Month YYYY in rapidly changing sections such as product lists, vendor capability summaries, and release-sensitive guidance.
See appendix/benchmark-and-evidence-policy.md for the full policy.

⚙️ Orchestration Frameworks

Audience: practitioners · Evidence class: mixed

Last reviewed: April 2026.

Deep Dives

Evidence tags follow the Benchmark and Evidence Policy. Scored against RUBRIC.md; cap of 5–8 deep-dive entries enforced.

Framework	Ecosystem Maturity	Description	Architectural Strengths	Operational Constraints	Workload Suitability	Design Paradigm	Governance Fit	Evidence
LangGraph	Production-ready	Is: Stateful orchestration framework building directed graphs with typed state. Demonstrates: Deterministic execution control mixed with LLM reasoning.	Explicit state management, persistence, and support for complex multi-actor workflows.	Verbose abstractions, steep learning curve, and graph sprawl if the workflow is over-modeled.	Strong fit for multi-step, stateful, and interruptible agent systems. Poor fit for simple single-prompt completions or linear chains.	DAG-based state machine.	Good fit for auditable workflows and approval gates, but graph edges must be tightly constrained to avoid runaway loops.	`[official]` docs · `[field report]` LinkedIn SQL Bot
Microsoft Agent Framework	Production-ready	Is: Microsoft’s unified agent framework merging Semantic Kernel and AutoGen; first-class MCP and A2A support. Demonstrates: Enterprise-grade agent composition with typed plugins, approval workflows, and Azure integration.	Strong .NET + Python parity, typed function-calling, native MCP/A2A, and OpenTelemetry tracing.	Broader Azure coupling in the managed path; framework surface is still stabilizing post-merger.	Strong fit for enterprise teams already on Azure / Semantic Kernel and needing multi-language agents. Poor fit for teams wanting a minimal Python-only stack.	Typed plugin graph with pluggable orchestration (sequential, group chat, handoff).	Strong — supports approval gates, policy plugins, and audit logging out of the box.	`[official]` repo · `[official]` announce
AutoGen	Production-ready	Is: Microsoft Research multi-agent conversation framework; now an orchestration pattern inside Microsoft Agent Framework. Demonstrates: Conversable agents with group chat, code-executor, and human-proxy patterns.	Battle-tested multi-agent conversation patterns, large research footprint, flexible role composition.	Emergent conversation loops need explicit termination conditions; observability requires added tooling.	Strong fit for research on multi-agent collaboration and code-gen crews. Poor fit for strictly deterministic workflows.	Conversational multi-agent loop with configurable managers.	Needs explicit stop conditions and sandboxed code execution to be safe in production.	`[official]` v0.4 docs · `[benchmark]` AutoGen paper
OpenAI Agents SDK	Production-ready	Is: OpenAI’s official agents SDK with handoffs, guardrails, and sessions; successor path to Assistants API. Demonstrates: First-party multi-step agents with tool-use, tracing, and structured handoffs.	Tight integration with OpenAI tools, built-in tracing, ergonomic Python API, provider-agnostic via LiteLLM.	Primary optimization target is OpenAI models; porting to other providers loses some ergonomics.	Strong fit for teams shipping OpenAI-backed agents quickly with tracing. Poor fit for strict provider portability or local-only models.	Handoff-based multi-agent loop with sessions.	Viable for hosted approval flows; guardrails are first-class primitives.	`[official]` docs · `[official]` repo
CrewAI	Emerging	Is: Multi-agent collaboration framework where agents are assigned roles, goals, and tools. Demonstrates: Role-based agentic workflows with sequential and hierarchical processes.	Simple mental model and fast team-based decomposition for prototypes; growing enterprise feature set.	Less control for highly complex or non-standard systems; observability and typed state are weaker than LangGraph/MAF.	Strong fit for rapid prototyping of agent teams. Poor fit for deterministic execution, rigorous type safety, or custom orchestration loops.	Role-based sequential or hierarchical process execution.	Requires added guardrails and observability to manage emergent loops and inconsistent agent behaviour.	`[official]` docs · `[field report]` case studies
Pydantic AI	Production-ready	Is: Framework built directly on Pydantic enforcing strict data validation and type-safe outputs from LLMs. Demonstrates: Type-driven agentic execution and dependency injection.	Strong type-system integration, schema enforcement, dependency injection, and retry support.	Smaller surrounding ecosystem than older orchestration stacks; retry loops can increase latency and cost.	Strong fit for production systems needing strict type safety and predictable parsing. Poor fit for open-ended generative writing or weakly structured tasks.	Strongly typed, schema-first LLM interactions.	Good fit where schema validation and dependency control matter, but retry policies need explicit cost and failure bounds.	`[official]` docs
Smolagents	Emerging	Is: Minimalist framework using `CodeAgents` (Python logic code generation over JSON calling). Demonstrates: Code-first model execution bounds.	Lightweight core and direct execution model that stays close to Python control flow.	Weak typed-state enforcement and high exposure if generated code runs with broad permissions.	Strong fit for fast prototyping and Python-native experimentation. Poor fit for regulated networks or systems that need strict sandboxing and observability.	Python-native logic execution via LLM generation.	Requires strong sandboxing, network controls, and review boundaries before production use.	`[official]` docs

Frameworks Landscape

Broader catalog beyond the deep-dive set. Each subsection capped at 8 entries; entries that cannot clear the rubric were removed in this phase (see PR body for cut list).

General Purpose

Framework	Lang	Description	Evidence
LangChain	Py/JS	Modular framework with chains, tools, memory, and broad integration coverage.	`[official]`
LangGraph	Py/JS	Graph-based orchestration. Stateful typed-state graphs with checkpointing.	`[official]`
LlamaIndex	Py/JS	Data-centric framework for retrieval-heavy and RAG-oriented agent systems.	`[official]`
Haystack	Py	Pipeline-based framework for search, retrieval, and hybrid agent workflows.	`[official]`
Semantic Kernel	C#/Py/Java	Microsoft enterprise kernel; now a composable layer inside Microsoft Agent Framework.	`[official]`
Microsoft Agent Framework	Py/.NET	Microsoft’s unified agent framework merging Semantic Kernel and AutoGen; first-class MCP and A2A support.	`[official]`
Pydantic AI	Py	Type-safe, Pydantic-native; schema-first LLM interactions with dependency injection.	`[official]`
DSPy	Py	Stanford. Programming not prompting; compiler optimizes prompts against metrics.	`[official]` · `[benchmark]`

Multi-Agent Orchestration

Framework	Lang	Description	Evidence
AutoGen	Py	Microsoft Research multi-agent conversations; v0.4 redesigned for async event-driven execution.	`[official]` · `[benchmark]`
CrewAI	Py	Role-based crew members with goals, tools, and sequential/hierarchical processes.	`[official]`
OpenAI Agents SDK	Py	Official OpenAI multi-step agents with handoffs, guardrails, sessions, and tracing.	`[official]`
Google ADK	Py	Native Gemini multi-agent orchestration; deploys to Vertex AI Agent Engine.	`[official]`
MetaGPT	Py	PM / architect / engineer roles simulating a software company; research-oriented.	`[official]` · `[benchmark]`
CAMEL	Py	Role-based simulation and collaborative reasoning research framework.	`[official]` · `[benchmark]`
DeerFlow	Py	ByteDance orchestration system for planning, tools, memory, and execution.	`[official]`
AgentScope	Py	Alibaba multi-agent framework with message-passing runtime and distributed mode.	`[official]`

Lightweight / Minimalist

Framework	Lang	Description	Evidence
Smolagents	Py	HuggingFace minimal agents (~1000 lines); code-action agents with sandboxed execution.	`[official]`
Agno	Py	Lightweight, model-agnostic agent framework with native multi-modal support.	`[official]`
Upsonic	Py	MCP-first framework with minimal setup and typed task graphs.	`[official]`
Portia AI	Py	Plan-based agent framework aimed at reliable production deployments with approval gates.	`[official]`
Mastra	TS	TypeScript-first framework with observability, workflows, and memory.	`[official]`

📡 Protocols and Standards

Audience: practitioners · Evidence class: official

Last reviewed: April 2026.

Protocols are the stable contracts between agents, tools, and hosts. Each entry below distinguishes the specification from any specific implementation — mixing the two is a repeat anti-pattern (see ANTI-PATTERNS.md).

Protocol	Kind	Description	Evidence
MCP (Model Context Protocol)	Open spec	Anthropic-authored open standard for exposing tools, resources, prompts, and sampling to LLM hosts; wide multi-vendor adoption in 2025–2026.	`[official]` spec
A2A (Agent2Agent)	Open spec	Google-originated, Linux Foundation–hosted protocol for secure cross-agent communication across vendors and frameworks.	`[official]` spec
OpenAI Function / Tool Calling	Vendor API	Native structured tool invocation for OpenAI models; JSON-schema-typed tool definitions.	`[official]`
Anthropic Tool Use	Vendor API	Native structured tool invocation for Claude models; supports parallel tool calls and computer-use tools.	`[official]`
OpenAPI	Open spec	Industry-standard HTTP API specification; foundation for typed, discoverable tool surfaces behind MCP or direct function-calling.	`[official]`

🛂 Agent Authority, Identity & Delegation

Audience: practitioners, AI security, enterprise architects · Evidence class: mixed

Last reviewed: April 2026.

Protocols like MCP, A2A, and function calling explain how agents connect to tools and systems. This section covers who or what is authorised to use those connections, and how that authority is governed in production — what the industry has begun to call the AI Agent Authority Gap.

Action-taking agents do not emerge with independent authority. They are triggered, invoked, provisioned, or empowered by existing enterprise identities — human users, service accounts, bots, machine identities, OAuth tokens, API keys, tool connectors, MCP servers, CLIs, scripts, and automation infrastructure. The same agent loop can reach into source control, ticketing, messaging, cloud APIs, and private data through credentials that were never designed for autonomous use.

Agents are not just a new identity type. They are a delegated identity type. Their authority originates from traditional enterprise actors — humans, bots, service accounts, and machine identities — and is inseparable from the posture of those delegators.

Traditional IAM was built to answer a narrow question: who has access? Once agents are introduced, the operative question shifts:

What authority is being delegated, by whom, under what conditions, for what purpose, and across what scope?

Identity dark matter is the term for authority that exists, operates, and accumulates risk outside the view of managed IAM — fragmented human and machine identities, embedded credentials, unmanaged service accounts, bot accounts, and application-specific identity logic. If that dark matter remains unobserved, agents inherit an already-broken authority model and become efficient amplifiers of hidden access, hidden permissions, and hidden execution paths.

The practical consequence is sequencing: an enterprise cannot safely govern Agent-AI unless it first observes and governs the traditional actors that serve as its delegation source. Closing the authority gap is therefore a delegation problem first, and an agent problem second.

Core Concepts

Concept	Definition	Why it matters
Agent Identity	A first-class, distinguishable identity for the agent itself, separate from the human, service account, or bot whose authority it borrows.	Without it, every agent action is attributed to its delegator or a shared service account, breaking attribution, audit, and incident response.
Delegation Source	The human, service account, bot, or machine identity that triggers, invokes, or empowers the agent.	Agent authority is inseparable from delegator posture; a poorly governed delegator yields a poorly governed agent.
Delegation Chain	The end-to-end path delegator → agent → tool → target application → action, including any intermediate scopes, tokens, and approvals.	Makes the path of authority explicit and reviewable instead of inherited from whatever credentials the runtime happens to hold.
Identity Dark Matter	Authority — identities, credentials, tokens, service accounts, embedded secrets, application-specific access paths — that exists and operates outside the view of managed IAM.	Agents amplify hidden authority into automated action at machine speed; un-illuminated dark matter becomes the agent’s effective permission set.
Authority Boundary	The explicit set of applications, workflows, scopes, and actions an agent is allowed to touch.	Defines the blast radius. An agent without a stated boundary effectively has the union of every credential its delegator can reach.
Dynamic Sequential Delegation Control	Runtime authority decisions that combine the posture of the delegator, the context of the target application, the intent behind the requested action, and the scope of execution.	Static, long-lived permissions do not match agents whose delegators, tasks, and risk levels change per run.
Continuous Observability	A live feed of identity behaviour across managed and unmanaged environments, used as input to authority decisions rather than as after-the-fact reporting.	Turns observability into governance: the same telemetry that illuminates dark matter becomes the decision input for what the agent is allowed to do next.
Human Approval Gate	A required human checkpoint before sensitive or irreversible actions are executed.	Keeps high-impact decisions reviewable; separates recommend from execute on the act / recommend / constrain / stop continuum.
Audit Trail	An immutable record linking delegator → agent identity → delegated authority → tool call → target application → decision point → result.	Required for incident response, compliance, and learning loops. Without it, agent failures are unreproducible and dark matter stays dark.
Revocation Path	A defined, tested way to disable an agent, rotate its credentials, and invalidate its delegated authority — including upstream delegator credentials when needed.	When an agent or its delegator is compromised, you need a working off switch, not a code change.

Design Principles

Treat agents as delegated actors, not autonomous islands; their authority is bounded by the posture of their delegation source.
Govern the delegation source first. Reduce identity dark matter across human and machine identities before granting agents broad action rights.
Map the delegation chain explicitly: delegator (human / service account / bot / machine identity) → agent → tool → target application → action.
Use least privilege, scoped tokens, short-lived credentials, and tested revocation paths at every link in the chain.
Govern authority continuously on posture, context, intent, and scope — not only on nominal permissions issued at provisioning time.
Operate agents on an act / recommend / constrain / stop continuum: not every request should result in execution; some should be downgraded to recommendation, restricted to a limited tool set, or blocked.
Separate recommend, prepare, execute, and approve as distinct steps in the agent loop, and require human approval gates for sensitive or irreversible actions (writes to production, financial transactions, external messages, code merges).
Log the delegator, agent identity, delegated authority, tool call, target application, decision point, and result for every action — not just the final answer.
Evaluate whether agents attempt to exceed their authority or amplify dark matter, not only whether they complete tasks.
Do not confuse model guardrails with identity governance. Production systems need both: prompt-level safety and continuous, IAM-level delegation control.

Resources

Resource	Focus	Why it matters
Bridging the AI Agent Authority Gap	Authority gap, delegation source, identity dark matter, continuous observability as a decision engine	Frames AI agents as a delegated identity type whose authority originates from humans, bots, service accounts, and machine identities, and argues for governing the delegation source before the agent.
OWASP GenAI Security Project	GenAI and agent security controls	Useful governance and threat-modelling reference for agentic systems, including excessive agency and tool-use risks.
Model Context Protocol	Tool and context protocol	Tool access expands the authority surface of agentic systems; MCP is a primary vector for that expansion and where scoping must be applied.
Auth0 AI Agents / Token Vault	Scoped delegated access for AI agents	Practical identity pattern for managing agent access to user-authorised systems with short-lived, scoped tokens.

🧭 Reasoning & Planning Models

Audience: researchers · Evidence class: benchmark

Last reviewed: April 2026.

Models that do explicit reasoning or planning at inference time — chain-of-thought baked into the decoding loop, extended thinking budgets, or trained planner heads. They change the shape of agent loops: the model absorbs work that used to live in a planner node, which shifts where you spend tokens, latency, and trust. Cap of 5–8 entries; selected for agentic relevance, not general benchmark wins. Same-family tiers (e.g. mini / nano, Sonnet / Haiku, Flash / Flash-Lite) are grouped into one row because they share the same reasoning interface and differ mainly in latency and cost.

Model	Provider	Reasoning Mode	Why it matters for agents	Evidence
GPT-5.4 (family: `gpt-5.4` / `mini` / `nano`)	OpenAI	Tunable reasoning effort (`none`/`low`/`medium`/`high`/`xhigh`) with native computer-use, web/file search, and function tools	Single family covers planner (`gpt-5.4`), subagent/actor (`mini`), and high-volume tool-calling (`nano`) — lets one agent loop span tiers without swapping SDKs.	`[official]` model index · `[official]` `gpt-5.4` · `mini` · `nano`
Claude Opus 4.7 / Sonnet 4.6 / Haiku 4.5	Anthropic	Configurable extended-thinking budget, interleaved with tool calls	Same thinking-budget dial across a planner/worker/fast-actor trio; agent-friendly latency/cost staging without changing prompt contract.	`[official]` system cards · `[official]` extended thinking
Gemini 3.1 Pro / 3 Flash / 3.1 Flash-Lite	Google DeepMind	Deep Think parallel-hypothesis reasoning (Pro); thinking-capable Flash tier for cheaper loops	Three-tier reasoning ladder over Google’s long-context stack — Pro for planning, Flash/Flash-Lite for high-fanout tool calls in the same agent graph.	`[official]` model cards · `[official]` Gemini 3.1 Pro card
DeepSeek-R1	DeepSeek	RL-trained reasoning traces, open weights	First strong open-weight reasoning model; reproducible baseline for planner research and local agent stacks.	`[official]` repo · `[benchmark]` paper
Qwen3 (thinking mode)	Alibaba	Switchable thinking / non-thinking modes	Open-weight family with explicit thinking toggle — useful when you want the same model in both planner and actor roles.	`[official]` repo · `[benchmark]` tech report
Grok 4	xAI	Native reasoning with tool use	Aggressive frontier-reasoning entrant; useful as a diversity source in multi-model planner ensembles.	`[official]` page

Decision guide: if your agent loop already does explicit plan → act → verify steps, a reasoning model can often replace the planner node — but it rarely removes the need for typed state, tracing, and eval. Treat reasoning as a cheaper planner, not a free reliability upgrade.

🧪 Evaluation & Safety

Audience: researchers · Evidence class: benchmark

Last reviewed: April 2026.

This section covers frameworks and operational tooling for testing agent quality, correctness, task completion, regressions, and system behaviour, as well as security scanning, red teaming, policy testing, and misalignment research. Evidence tags follow the Benchmark and Evidence Policy.

Core Evaluation Areas

Output correctness
Reasoning quality
Tool-use accuracy
Latency and cost
Robustness under adversarial input

Evaluation Frameworks

Framework	Description	Methodology / Workload Suitability	Evidence
OpenAI Evals	Core framework for testing and improving AI systems.	Foundational evaluation framework and methodology.	`[official]`
DeepEval	Open-source LLM evaluation framework with metrics for hallucination, answer relevance, and task completion.	Application-level evaluation and regression testing.	`[official]`
promptfoo	CLI and library for evaluation and red teaming of LLM apps.	Regression testing, prompt/application evals, adversarial testing.	`[official]`
Inspect	UK AI Security Institute’s framework for rigorous LLM evals covering coding, reasoning, agent behavior, and model-graded scoring.	Rigorous research-grade and agent-task evaluation.	`[official]` · `[benchmark]`
Azure AI Evaluation SDK	Azure Foundry evaluation SDK with built-in agent, safety, and quality evaluators.	Enterprise agent evaluation tied to Foundry tracing.	`[official]`

Key Practices

Golden datasets
Regression testing
Adversarial / red-team inputs
Continuous evaluation pipelines

Tracing and Monitoring

Tool	Description
Langfuse	OSS LLM observability. Traces, evals, prompts.
LangSmith	LangChain platform. Tracing, testing, evaluation.
Braintrust	Eval-driven development. Experiment tracking.
Arize Phoenix	OSS AI observability. Traces, evals, embeddings.
Helicone	OSS LLM observability. One-line integration.
Weights and Biases Weave	Trace and evaluate LLM apps.

Benchmarks

Benchmark	Description	Evidence
SWE-bench	Coding-agent benchmark grounded in real GitHub issues and patches; `Verified` subset is the canonical agent workload.	`[official]` · `[benchmark]`
AgentBench	8-environment LLM agent benchmark covering OS, DB, web, and game tasks.	`[official]` · `[benchmark]`
Terminal-Bench	Evaluates terminal-agent execution on shell-based tasks with scored task completions.	`[official]` · `[benchmark]`
GAIA	General AI assistant benchmark with real-world multi-step tasks and tool use.	`[official]` · `[benchmark]`
WebArena / VisualWebArena	Web agent benchmark on real-website snapshots; visual variant tests multimodal web agents.	`[official]` · `[benchmark]`
τ-bench	Tool-use + user-simulation benchmark measuring agent reliability and consistency across trials.	`[official]` · `[benchmark]`
OSWorld	Computer-use benchmark for multimodal agents on real desktop OS tasks across Ubuntu/Windows/macOS; complements web-only benchmarks.	`[official]` · `[benchmark]` paper
LiveCodeBench	Contamination-resistant coding benchmark with time-stamped problems from LeetCode/AtCoder/Codeforces; complements SWE-bench’s repo-issue workload.	`[official]` · `[benchmark]` paper
WebVoyager	Web-agent benchmark on live production websites (not snapshots); tests multimodal browsing under real network and UI drift conditions.	`[official]` · `[benchmark]` paper

Safety Risk Surfaces & Mitigations

⚠️ Core Risk Surfaces	🛡️ Mitigation Strategies
Prompt injection (direct & indirect)	Input validation and filtering
Tool misuse	Tool permissioning and sandboxing
Data exfiltration	Human-in-the-loop approval gates
Memory poisoning	Audit logs and traceability
Unbounded autonomous behaviour	Policy-driven execution

Safety Tooling & Methodologies

Resource	Description	Workload Suitability	Official Link
garak	LLM vulnerability scanner probing for hallucination, leakage, injection, toxicity, and jailbreaks.	Automated red teaming & vulnerability scanning	GitHub
OWASP GenAI Security Project	Governance and mitigation framework for safety risks in LLMs and agentic systems.	Governance, controls, and secure-design reference	Project Home
Anthropic Alignment Stress-Testing	Research and operational approach for deliberately stress-testing alignment evals and oversight.	Research-driven safety evaluation methodology	Post
Model Organisms of Misalignment	In-vitro demonstrations of alignment failures so they can be studied empirically.	Advanced safety research and methodology	Post
AI Safety via Debate	Alignment framework for cases where direct human supervision is too hard.	Alignment and scalable oversight resource	Paper
Concrete Problems in AI Safety	Foundational framing paper for safety problems (side effects, reward hacking, safe exploration, shift).	Foundational safety resource	Paper
Anthropic Agentic Misalignment	Grounds safety concerns in concrete behaviours (blackmail, espionage) in simulated settings.	Applied safety & threat-modelling reference	Research Post

AI Guardrails

Tool	Description
Guardrails AI	Structural, type, quality guarantees for LLM outputs.
NeMo Guardrails	NVIDIA. Programmable conversation guardrails.
LLM Guard	Security toolkit. Input/output scanning.
Rebuff	Prompt injection detection.
Lakera Guard	Real-time protection. Prompt injection, data leakage, toxicity.

🧠 Skills and Operating Principles

Audience: practitioners · Evidence class: field report

Building agentic systems requires a shift in skillset:

Problem decomposition
System design and orchestration
Tool and interface design
Memory modelling
Evaluation design
Failure mode analysis
Safety and governance thinking

🚫 What NOT to Do

Audience: all contributors · Evidence class: mixed

To keep this repository genuinely opinionated, we advocate against these common anti-patterns:

Do not begin with multi-agent systems when a single agent plus tools will do. Escalate to multi-agent only when task decomposition requires it.
Do not add memory before defining what deserves persistence. Avoid “state bloat” by being intentional about what is stored and why.
Do not treat tracing as optional for long-running systems. Observability is the only way to debug non-deterministic agentic failures.
Do not confuse benchmark wins with production readiness. Real-world reliability requires evaluation on your specific data and edge cases.
Do not use framework abstractions as a substitute for architecture. Understand your control flow before outsourcing it to a library.
Do not give agents inherited or ambient authority without mapping the delegation chain. Every action-taking agent should have a clear identity, scoped permissions, observable tool calls, approval gates where needed, and a revocation path. See Agent Authority, Identity & Delegation.

📊 Signals (How to Read This List)

Audience: all contributors · Evidence class: mixed

⭐ Production-grade
🧪 Experimental
⚠️ Early-stage / unstable
🛂 Authority-aware — explicitly models identity, delegated permissions, approval gates, and auditability

🚀 Getting Started

Audience: practitioners · Evidence class: mixed

Choose a core pattern (e.g. single-agent + tools)
Add structured tool use
Introduce evaluation early
Layer in memory only when needed
Expand into multi-agent systems with clear roles
Add observability and safety constraints

🤝 Contributing

Audience: all contributors · Evidence class: mixed

Contributions are welcome! Please read the CONTRIBUTING.md for full details before submitting a pull request.

At a high level, submissions must meet the following criteria:

Clear description of purpose
Architectural strengths and operational constraints
Governance fit and workload suitability
Evidence of ecosystem maturity or real-world usage (preferred)
Evidence tags and Last reviewed markers where claims are time-sensitive or likely to change

This is a curated list, not an exhaustive one.

See appendix/benchmark-and-evidence-policy.md for the sourcing, evidence-tagging, and Last reviewed policy.

📌 Final Note

Audience: all contributors · Evidence class: mixed

The shift to agentic systems is not about more tools.

It is about:

Designing systems that can reason, act, evaluate, and improve
Ensuring those systems are reliable, observable, and safe

Build accordingly.

This site is open source. Improve this page.