Manipulation

Manipulation covers the methods, models, and tooling used by robots to grasp, place, assemble, and otherwise interact with objects under contact. It spans analytic grasp planning, learned visuomotor policies, bimanual and dexterous control, and the data-collection rigs (teleoperation, demonstrations) that feed them.

From an engineering standpoint, manipulation is where Physical AI meets the hardest parts of the real world: contact dynamics, occlusion, deformables, and long-horizon dependencies between sub-skills. It is the category where simulation gaps bite hardest, where data quality dominates model choice, and where small changes in gripper geometry, controller stiffness, or camera placement routinely outweigh algorithmic differences.

When choosing between methods, match the task structure (single-skill vs. long-horizon, single-arm vs. bimanual, rigid vs. deformable) to the policy class (analytic grasp planner, behaviour cloning, diffusion policy, VLA) and the data regime you can realistically support. Diffusion- and transformer-based imitation policies have become strong defaults; analytic grasping remains the right answer when geometry is well-known and contact is simple.

Start here

Diffusion Policy is the modern baseline for visuomotor manipulation: simple to set up, strong on contact-rich tasks, and the policy class most recent results are compared against.

Diffusion Policy — Visuomotor policy learning via action diffusion; widely-used baseline for imitation.
ACT (Action Chunking Transformers) — Transformer policy for bimanual fine manipulation from demonstrations.
Mobile ALOHA — Bimanual mobile manipulation system with low-cost teleoperation hardware.
ALOHA Unleashed — Recipe for scaling robot dexterity via large-scale imitation learning.
Dex-Net — Datasets and models for analytic and learned robust grasping.
Contact-GraspNet — Grasp pose generation directly from partial point clouds.
RoboCasa — Large-scale household-scene simulation suite for training generalist manipulation policies.
MIT 6.4210 — Robotic Manipulation — Russ Tedrake's reference text covering perception, planning, and control for manipulation.
PerAct — 3D voxel-action transformer for language-conditioned, long-horizon manipulation.
CLIPort — Language-conditioned manipulation with CLIP-based perception and transport-based action heads.
Transporter Networks — Keypoint-based pick-and-place architecture for data-efficient tabletop manipulation.
GraspNet-1Billion — Large-scale benchmark and dataset for robust 6-DoF grasp planning.
AnyGrasp — Efficient 6-DoF grasp generation framework for real-time deployment.
3D Diffusion Policy (DP3) — Point-cloud-conditioned diffusion policy that improves data efficiency and robustness over image-based variants.
robosuite — Modular simulation framework for manipulation research with reproducible task environments.