A position paper published on arXiv challenges the dominant paradigm in embodied intelligence research. The team from Motoniq and collaborators argues that simply scaling up Vision-Language-Action (VLA) models and world models cannot achieve general-purpose robot intelligence.
The paper identifies four critical components missing from current approaches: Physical Data Engine with Embodied Autolabelling, Cross-Embodiment Task-Preserving Retargeting, Physics-Grounded World Models, and Self-Improving Deployment Loops.
According to the researchers, current robots still rely heavily on pre-organized training data, video supervision cannot directly translate to robot-executable actions, and existing world models often fail to preserve critical physical variables like contact, force, and material response.
The authors suggest that the path forward requires building a physical data engine that unifies heterogeneous data sources into a common underlying physical structure, enabling robots to learn beyond demonstration data.
