Glossary · AI & ML
VLA model
Also known as: vision-language-action model
In brief
A vision-language-action (VLA) model is a single neural network that takes camera images plus a natural-language instruction and outputs motor commands. VLAs replace the older split between separate perception, planning, and control stacks with one end-to-end policy.
In a humanoid context, a VLA model takes the robot's current camera feeds (vision) plus a goal expressed in language ("put the blue mug in the dishwasher") and emits low-level motor commands or end-effector trajectories. The whole pipeline is a single forward pass through a transformer-style network, distinct from the older robotics stack where perception, planning, and control were separately engineered modules.
VLA models are the primary research direction in 2025–2026 for general-purpose humanoid control. Figure's Helix and Tesla's Optimus AI architecture are the most public examples; NVIDIA's GR00T is a foundation-model base that vendors fine-tune. The advantage is generalization — a single trained policy can handle many tasks without per-task engineering. The disadvantage is data hunger — VLAs need huge amounts of demonstration or simulation data.
Related terms
See it in the wild
Browse robots and brands using these techniques
Glossary entries are upstream. The catalog is where the implementations live.