Robotics Foundation Models

Robotics foundation models — increasingly framed as vision-language-action (VLA) models — are large pretrained policies that map perception and language directly to robot actions. They aim to be generalist across tasks, scenes, and in some cases embodiments, in the same way that LLMs are generalist across text tasks.

From an engineering standpoint, these models are reshaping the stack: classical perception–planning–control pipelines are being replaced or augmented by a single learned policy, with fine-tuning and prompting as the primary integration surface. That shift moves a lot of risk into data quality, evaluation methodology, and runtime safety, because the model's competence boundary is no longer something you can read off a state machine.

When choosing between options, weigh embodiment coverage vs. your target robot, openness (weights, training data, fine-tuning recipes), action-space conventions (continuous control, discretised tokens, flow matching), and inference cost at the control frequency you need. Open models with reproducible recipes are usually a better starting point than closed APIs for any system you intend to deploy and evaluate rigorously.

Start here

OpenVLA is the most accessible entry point: open weights, open training recipe, and a 7B-parameter VLA built on a strong vision-language backbone.

π0 (Physical Intelligence) — Generalist policy combining multi-robot data with flow matching for dexterous manipulation.
Octo — Open-source generalist robot policy trained on Open X-Embodiment with cross-embodiment fine-tuning.
OpenVLA — Open-source 7B-parameter vision-language-action model built on Prismatic VLMs.
RT-2 — Vision-language-action model that transfers web knowledge to robotic control.
RT-X — Cross-embodiment models demonstrating positive transfer across robot platforms.
Gemini Robotics — Google DeepMind VLA family with embodied reasoning capabilities.
GR00T N1 (NVIDIA) — Open humanoid foundation model with a dual-system slow/fast architecture.
Helix (Figure) — Vision-language-action model targeting generalist humanoid control.
RT-1 — Robotics Transformer for large-scale real-robot manipulation with language-conditioned control.
PaLM-E — Embodied multimodal language model integrating visual and robot-state observations for action.
SayCan — Language-model-guided skill selection framework for grounded robot task execution.
Code as Policies — Program-synthesis approach that compiles language instructions into executable robot policies.
VIMA — Promptable transformer for multimodal robot manipulation via in-context generalization.
Gato — Generalist policy architecture spanning embodied control and non-robotic tasks with tokenized actions.
RoboFlamingo — Open vision-language-action model for low-cost adaptation to robot manipulation tasks.