Researchers have introduced Thinker, a 7-billion-parameter vision-language foundation model purpose-built for embodied intelligence, achieving state-of-the-art results on key robot task planning benchmarks. The model addresses fundamental challenges that even advanced VLMs struggle with in robotics, including confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during reasoning.
Thinker employs a two-stage training strategy. Stage 1 establishes basic perception and reasoning capabilities using a mix of general datasets, spatial understanding data, and large-scale planning datasets. Stage 2 applies supervised fine-tuning for specific downstream task alignment. A critical innovation lies in its video understanding approach: by jointly incorporating key frames and full videos as inputs, the model substantially enhances temporal comprehension.
The research team constructed a comprehensive 1.8-million-sample robot planning dataset called Robovideo-1.8M, alongside an industrial task planning dataset Industroplan-200K. Additional training data includes 570K+ visual grounding samples for fine-grained spatial understanding and 100K ego-view reasoning samples.
On the Robovqa benchmark, Thinker-7B achieved an average BLEU score of 63.5, surpassing GPT-4V and all existing robotic vision-language models. On the Egoplan-bench2 benchmark, it achieved 58.2% top-1 accuracy, comprehensively outperforming all baselines including Qwen2.5-VL-7B and Cosmos-Reason1-7B. The model demonstrated particular strength in long-horizon task planning and spatial reasoning.
The paper's key contributions include: building the largest dedicated robot planning dataset (Robovideo-1.8M), developing a specialized 7B-parameter vision-language model for robot manipulation, achieving SOTA across multiple robot benchmarks proving the value of specialized training, and a commitment to open-sourcing the complete technical report, architecture, and weights.
