Open-Source Breakthrough in Embodied Reasoning
A joint research team from Zhejiang University's College of Computer Science, the Institute of Software at the Chinese Academy of Sciences (CAS), and Alibaba Group's DAMO Academy has released Embodied-Reasoner, a fully open-source multimodal embodied reasoning model that brings o1-style deep thinking to interactive physical tasks.
Outperforming Industry Giants
In comprehensive evaluations across 809 test cases in the AI2-THOR simulator, Embodied-Reasoner (7B) achieved:
- 80.96% task success rate vs 71.73% (OpenAI o1), 56.55% (o3-mini), 67.70% (Claude-3.7)
- 55.07% search efficiency, the highest among all tested models
- 86.30% task completeness, outperforming all competitors
- 54.29% success rate on composite multi-step tasks, nearly 4x better than o3-mini
Three-Stage Training Pipeline
The model's superior performance stems from an innovative three-stage training approach:
- Imitation Learning: Fine-tuning on 9,300 synthesized Observation-Thought-Action trajectories (64K images, 8M thought tokens) covering 107 indoor scenes.
- Self-Exploration (Rejection Sampling): The model generates multiple trajectories on novel tasks and uses successful ones to enhance exploration abilities.
- Self-Correction (Reflection Tuning): By injecting anomalous states and reflective thinking, the model learns to detect and correct its own errors.
Real-World Validation
Beyond simulation, the team validated Embodied-Reasoner in real-world object search tasks across kitchen, bathroom, and bedroom scenes. The model demonstrated consistent spatial reasoning and efficient search behavior, avoiding the repetitive searches and logical inconsistencies observed in OpenAI o3-mini.
Open Availability
Embodied-Reasoner is available in 2B and 7B parameter versions, with the complete training dataset and codebase released on GitHub and Hugging Face.
Paper: arXiv:2503.21696 | Code: https://github.com/zwq2018/embodied_reasoner




