Safety & Robustness

Safety and robustness covers the tools, benchmarks, and methodology used to constrain policy behaviour, stress-test it under adversarial or out-of-distribution conditions, and characterise its failure modes. It includes constrained RL, safe-exploration environments, formal verification, and the evaluation protocols used to gate deployment.

From an engineering and deployment standpoint, this is the category that determines whether a system is shippable. Capability metrics describe what a policy can do on a good day; safety and robustness work describes what it does on the worst day, in the long tail, and under correlated failures. For physical systems, that distinction is regulatory, contractual, and sometimes life-critical — not optional.

When choosing tools, separate training-time safety (constrained RL, safe exploration) from evaluation-time safety (adversarial scenarios, stress tests, formal verification) and apply both. Match the failure-mode taxonomy to your deployment context — collisions, force limits, drift, hallucinated actions — and pair quantitative robustness metrics with explicit human-in-the-loop fallback paths.

Start here

Safety Gym (OpenAI) is a clean starting point for hands-on work on constrained and safe-exploration RL, with reference environments and baselines that map directly to the literature.

Safety Gym (OpenAI) — Environments for benchmarking constrained and safe-exploration RL.
Safe Control Gym — Benchmark suite for safe learning-based control with constraints and disturbances.
Constrained Policy Optimization (Achiam et al.) — Canonical algorithmic framework for constrained safe RL.
Realistic Adversarial Driving (Wang et al.) — Methodology for stress-testing autonomous driving policies under adversarial conditions.
Robot Trust & Safety (Stanford CRFM) — Foundation-model centre research including robotic safety, evaluation, and failure modes.
Scenic — Probabilistic scenario-description language for specifying and generating stress-test scenes for autonomy.
Verifiable Reinforcement Learning (DeepMind) — Survey-style work on formal verification approaches for RL controllers.
Safety-Gymnasium — Modern safe-RL benchmark suite extending Safety Gym with richer constraints and tasks.
OmniSafe — Open-source safe-RL training framework with strong baseline implementations.
Control Barrier Functions — Core framework for safety-critical control via forward-invariant safety sets.
Responsibility-Sensitive Safety (RSS) — Formal safety model for decision-making in autonomous systems.
VerifAI — Falsification and formal-analysis toolkit for autonomy and cyber-physical systems.
S-TaLiRo — Temporal-logic falsification framework for stress-testing control-system requirements.
Safe Reinforcement Learning Survey — Survey of safe-RL methods, benchmarks, and open deployment challenges.
Robust Policy Optimization — Distributional-robust RL approach for improved policy reliability under uncertainty.