Ant Group's Tianji Lab has unveiled AoE (Always-on Egocentric Human Video Collection for Embodied AI), a comprehensive system designed to solve the critical data bottleneck hindering embodied AI development. The system transforms everyday consumer smartphones into powerful embodied data collection devices.
The core challenge in embodied AI today is the scarcity of high-quality training data. Traditional approaches face fundamental limitations:
- Physical Robot Teleoperation: High precision but extremely expensive, with typical data collection facilities producing only 10,000 to 100,000 hours annually
- Handheld Grippers (UMI-style): Lower hardware costs but still require dedicated operators, limiting scalability
- First-Person Video: Extremely low cost but lacks action trajectories (3D hand poses, camera poses)
AoE takes the third approach but industrializes the entire post-processing pipeline to transform raw video into usable training data.
Design Philosophy: Light Frontend, Heavy Backend
The frontend is deliberately kept simple and affordable:
- Device: User's own smartphone plus a 2.80 dollar neck-mounted bracket
- Personnel: Real workers in their actual job positions (car washers, mechanics, chefs, cashiers)
- Collection: App automatically detects hand-object interactions, no manual start or stop required
The backend uses sophisticated algorithms to fill in missing trajectories:
- 3D hand reconstruction from monocular video (MANO parameters)
- 6DoF camera trajectory estimation (SLAM plus depth priors)
- Action semantic annotation via multimodal LLMs
- Triple quality inspection (edge-side plus cloud plus human sampling)
Key Technical Innovations: Data Map plus Task Distribution system, Automated Quality Inspection Flywheel, and Heterogeneous Device Adaptation.
Results and Impact: Combining approximately 10 robot-free demonstrations with 1 real-robot demonstration delivers comparable performance to models trained entirely on physical robot data, reducing real-robot data needs by up to 20 times. The system offers over 2,000 hours of validated multimodal demonstrations with strong zero-shot transfer capabilities across different robot platforms.

