A research team from Tsinghua University's Department of Automation, in collaboration with Shouyi Technology, has released EgoEMG — the industry's first multimodal egocentric dataset that simultaneously provides EMG (surface electromyography), vision, depth, and motion data for hand pose estimation, all time-synchronized under a unified protocol.
The dataset and accompanying paper (arXiv:2605.05712) address a critical gap in embodied intelligence: the lack of a unified benchmark to compare vision-based and EMG-based approaches to hand perception. This 'last centimeter' challenge is fundamental to enabling robots to perform dexterous manipulation tasks.
EgoEMG features 41 participants, 10+ hours of synchronized multimodal data, and a learning-based markers2mano pipeline that reduces invalid frame rates from 12.7% (Meta's EMG2Pose baseline) to 3.6%, with a MANO-to-marker alignment error of just 4.3mm.
Key findings from the research:
- Pure EMG error is 2.4x greater than pure vision — a structural limitation of signal information density
- Cross-user generalization remains a fundamental challenge for pure EMG approaches
- Vision-dominant multimodal fusion achieves the best results, with EMG serving as a complementary modality for occluded scenarios
The team also designed EMGFormer, a novel architecture for EMG-to-pose estimation that achieves 22% improvement over the previous state-of-the-art on the hardest generalization subset. A residual fusion architecture was introduced for EMG+vision integration, where the EMG branch learns only what vision cannot see — such as finger bending during occlusion.
EgoEMG establishes three benchmark tasks — EMG→pose, vision→pose, and EMG+vision fusion — providing a standardized evaluation protocol that positions vision-dominant multimodal fusion as the most promising path forward for precise hand perception in embodied AI.


