EG
ACE-Ego VLA model overview showing unified egocentric human video and multi-embodiment robot data for VLA pretraining
ResearchJune 17, 2026Embodied Global Team

ACE Robotics and CUHK Open-Source ACE-Ego VLA Model, Achieving SOTA on Two Major Benchmarks

ACE Robotics and CUHK MMLab open-source ACE-Ego VLA model, achieving SOTA on RoboCasa GR1 (72.8%) and RoboTwin 2.0 (90.62%) benchmarks with Human-centric AI approach.

#ACE Robotics#大晓机器人#CUHK#VLA#ACE-Ego#embodied AI#open source#robot manipulation#具身智能
Reading in English

ACE Robotics (大晓机器人), in collaboration with the Multimedia Laboratory at The Chinese University of Hong Kong (CUHK MMLab), today announced the open-source release of ACE-Ego, a novel 'one-brain-multiple-forms' embodied operation Vision-Language-Action (VLA) model.

ACE-Ego represents the flagship implementation of ACE Robotics' 'Human-centric' ACE R&D paradigm for VLA pretraining. Unlike the industry-standard 'robot-centric' approach that relies on expensive teleoperation data collection, ACE-Ego converts大规模 low-cost first-person human videos into effective training signals for robot manipulation models.

The model achieves state-of-the-art performance on two major embodied intelligence benchmarks:

On RoboCasa GR1 TableTop — an internationally recognized humanoid manipulation benchmark — ACE-Ego achieves a 72.8% average success rate, surpassing NVIDIA GR00T (47.6%), PI π₀.₅, JD JoyAI-RA (63.2%), and XPeng DIAL (70.2%). In specific tasks like plate stacking and pot transferring, ACE-Ego exceeds 98% success rate.

On RoboTwin 2.0 — a high-difficulty dual-arm manipulation benchmark — ACE-Ego achieves 91.12% in clean scenarios and 90.62% in heavily randomized scenarios, demonstrating robust environmental adaptability with only 0.5 percentage points performance degradation from clean to randomized settings. This surpasses Tencent Hy-VLA (90.9%/90.1%), JD JoyAI-RA (90.48%/89.28%), Ant LingBot-VLA (88.56%/86.68%), and PI π₀.₅ (82.74%/76.76%).

ACE-Ego introduces four core mechanisms to bridge the gap between human video and robot data: unified camera-space action representation, unified morphology encoding, time-aligned dynamic chunking, and reliability-aware adaptive objective functions. These systematically resolve the quadruple heterogeneity challenge across spatial coordinate systems, embodiment structures, temporal frequencies, and label quality.

Experimental results confirm that adding large-scale first-person human video for joint pretraining improves model success rate from 68.3% to 72.8% on RoboCasa — a 4.5% absolute performance gain — proving the significant value of human-centric large-scale data pretraining.

The model has demonstrated practical capabilities in complex retail operations, including plastic bag packing, shoe box packaging, and coffee dispensing — tasks requiring long-horizon, contact-rich manipulation far beyond simple tabletop grasping.

The technical report is available on arXiv (2606.17200), with the project page at https://acerobotics-vla.github.io/ACE-Ego/. The model weights and code are being released to the open-source community.

Language: English- Showing content in English