
A May 17, 2026 review paper from researchers at Fudan University, the Shanghai Innovation Institute and the National University of Singapore defines and organizes roughly one hundred recent papers into a unified class called World Action Models (WAMs). The authors argue WAMs let robotic systems simulate the consequences of candidate actions before executing them, closing a core gap in robotics AI between forecasting and control. This matters because it enables training controllers from far larger, unlabeled video collections than traditional action — labeled robot datasets.
The survey divides WAM research into two main architectural families. Cascaded WAMs first predict a next frame or video and then derive control commands from that prediction; early examples include UniPi. Some cascaded systems extract motion fields or compute trajectories from geometric flows, while others predict compressed, abstract future representations to cut compute costs. Joint WAMs produce future frames and actions together inside a single model or in tightly coupled streams. The paper cites architectures such as GR-1, GR-2 and WorldVLA, alongside diffusion — based variants like PAD, UWM and DreamZero, which generate perceptual forecasts and control signals in parallel rather than in separate stages.
The authors contrast WAMs with existing vision — language–action (VLA) models and with standalone video generators. Conventional VLA systems typically map observations directly to actions without modelling how the environment will change, and video generators produce plausible future frames but are not tied to control. A recent team at Peking University is noted for formalizing a unified definition of world models that separates forecasting from control coupling; WAMs are presented as satisfying both capabilities.
A central practical payoff the paper highlights is generalization: by simulating how scenes evolve after candidate moves, WAMs can better handle unfamiliar objects and novel settings. Critically, WAMs can be trained from internet — scale, unlabeled footage — for example everyday person videos — a class of data largely unusable by traditional robot controllers that require labeled action footage.
The review preserves technical distinctions useful to practitioners. Some cascaded systems compute motion fields (examples include AVDC and 3DFlowAction) that permit geometric trajectory extraction, while other approaches (VPP, LAPA) skip pixel rendering and predict compact state representations to reduce compute. Joint approaches range from tokenized sequences that intermix images and actions to diffusion — based pipelines; NVIDIA’s Cosmos Policy and DreamDojo are cited as dual-use architectures that can function as controller, simulator or evaluator.
Positioning WAMs as a research roadmap rather than a single technique, the authors map work branching since 2024 and foreground concrete trade — offs for builders: rendered pixels versus abstract states, cascaded versus joint inference, and compute costs versus control fidelity. They frame immediate next steps as exploring these architectural choices and scaling training data to test WAMs’ generalization in real-world robot deployments.
Sources
Replies (0)
No replies in this topic yet.