ACE Robotics (大晓机器人), in collaboration with the Multimedia Laboratory at The Chinese University of Hong Kong (CUHK MMLab), today announced the open-source release of ACE-Ego, a novel 'one-brain-multiple-forms' embodied operation Vision-Language-Action (VLA) model.
ACE-Ego represents the flagship implementation of ACE Robotics' 'Human-centric' ACE R&D paradigm for VLA pretraining. Unlike the industry-standard 'robot-centric' approach that relies on expensive teleoperation data collection, ACE-Ego converts大规模 low-cost first-person human videos into effective training signals for robot manipulation models.
The model achieves state-of-the-art performance on two major embodied intelligence benchmarks:
On RoboCasa GR1 TableTop — an internationally recognized humanoid manipulation benchmark — ACE-Ego achieves a 72.8% average success rate, surpassing NVIDIA GR00T (47.6%), PI π₀.₅, JD JoyAI-RA (63.2%), and XPeng DIAL (70.2%). In specific tasks like plate stacking and pot transferring, ACE-Ego exceeds 98% success rate.
On RoboTwin 2.0 — a high-difficulty dual-arm manipulation benchmark — ACE-Ego achieves 91.12% in clean scenarios and 90.62% in heavily randomized scenarios, demonstrating robust environmental adaptability with only 0.5 percentage points performance degradation from clean to randomized settings. This surpasses Tencent Hy-VLA (90.9%/90.1%), JD JoyAI-RA (90.48%/89.28%), Ant LingBot-VLA (88.56%/86.68%), and PI π₀.₅ (82.74%/76.76%).
ACE-Ego introduces four core mechanisms to bridge the gap between human video and robot data: unified camera-space action representation, unified morphology encoding, time-aligned dynamic chunking, and reliability-aware adaptive objective functions. These systematically resolve the quadruple heterogeneity challenge across spatial coordinate systems, embodiment structures, temporal frequencies, and label quality.
Experimental results confirm that adding large-scale first-person human video for joint pretraining improves model success rate from 68.3% to 72.8% on RoboCasa — a 4.5% absolute performance gain — proving the significant value of human-centric large-scale data pretraining.
The model has demonstrated practical capabilities in complex retail operations, including plastic bag packing, shoe box packaging, and coffee dispensing — tasks requiring long-horizon, contact-rich manipulation far beyond simple tabletop grasping.
The technical report is available on arXiv (2606.17200), with the project page at https://acerobotics-vla.github.io/ACE-Ego/. The model weights and code are being released to the open-source community.

