Skip to content

Memory poisoning

Malicious input corrupts agent memory, leading to unsafe or unintended actions.

An attacker manipulates input during one session so a poisoned fact is committed to long-term memory; in a later, unrelated session the agent retrieves it as trusted state and lets it shape decisions the attacker is no longer present to make.

  • [*] --> UntrustedWrite : Manipulated — input arrives
  • UntrustedWrite --> StoredAsTrusted : No classification — at write
  • StoredAsTrusted --> RetrievedAsState : Future task — reads memory
  • RetrievedAsState --> InfluencesAction : Treated as — agent fact
  • InfluencesAction --> [*]



Defence classifies and tags every write with provenance and expiry, blocks instruction-shaped content from entering memory, and re-checks freshness and trust on every read so that no untrusted fact silently becomes durable agent state.

  1. Memory storeWrite-side — controls
  2. Memory storeRead-side — controls
  3. Write-side — controlsClassification — before write
  4. Write-side — controlsProvenance and — expiry tags
  5. Write-side — controlsAnomaly detection on — instruction-shaped content
  6. Read-side — controlsFreshness — filter
  7. Read-side — controlsInstruction-data — separation in memory
  8. Read-side — controlsLogging of — influential reads