VLA model — Definition & Role in Humanoid Robotics

VLA model

Also known as: vision-language-action model

In brief

A vision-language-action (VLA) model is a single neural network that takes camera images plus a natural-language instruction and outputs motor commands. VLAs replace the older split between separate perception, planning, and control stacks with one end-to-end policy.

In a humanoid context, a VLA model takes the robot's current camera feeds (vision) plus a goal expressed in language ("put the blue mug in the dishwasher") and emits low-level motor commands or end-effector trajectories. The whole pipeline is a single forward pass through a transformer-style network, distinct from the older robotics stack where perception, planning, and control were separately engineered modules.

VLA models are the primary research direction in 2025–2026 for general-purpose humanoid control. Figure's Helix and Tesla's Optimus AI architecture are the most public examples; NVIDIA's GR00T is a foundation-model base that vendors fine-tune. The advantage is generalization — a single trained policy can handle many tasks without per-task engineering. The disadvantage is data hunger — VLAs need huge amounts of demonstration or simulation data.

See it in the wild

Browse robots and brands using these techniques

Glossary entries are upstream. The catalog is where the implementations live.

Open catalog All terms