EG
Robot arm performing precision manipulation tasks in a laboratory setting
ResearchJune 9, 2026Embodied Global Team

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

VoLoAgent introduces a VLM-based physical orchestrator that treats robot capabilities as interruptible tools, enabling robots to perform complex open-vocabulary manipulation tasks with real-time planning and recovery.

Reading in English

Researchers have developed VoLoAgent, a Vision-Language Model (VLM) that orchestrates heterogeneous robot capabilities as interruptible tools for open-vocabulary long-horizon manipulation tasks.

Unlike virtual AI agents, the physical world does not pause for reasoning—the timing of decisions, actions, and tool calls is critical. VoLoAgent addresses this by planning, monitoring, and recovering in real-time, treating a VLA (Vision-Language-Action) model as an interruptible tool it can steer mid-rollout.

The team introduced RoboVoLo, a high-fidelity benchmark for evaluating open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge—with both task-level success and failure-mode diagnostics.

Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments demonstrating practical applicability.