EG
NVIDIA GR00T N1.5 humanoid robot AI training platform showcasing next-generation embodied intelligence technology
ResearchJune 14, 2026Embodied Global

Baidu Baige's LoongForge Halves Training Cycle for NVIDIA GR00T N1.6, Delivering 2.3x Throughput

Baidu Baige's LoongForge delivers end-to-end system optimization for NVIDIA GR00T N1.6 VLA model training, achieving 2.3x throughput and cutting training cycle by 56.6 percent through intelligent IO, communication optimization, and operator scheduling.

#NVIDIA#Baidu#LoongForge#GR00T#VLA#training optimization
Reading in English

Baidu Baige's LoongForge has achieved a breakthrough in optimizing NVIDIA's GR00T N1.6 Vision-Language-Action (VLA) model training, delivering up to 2.3x training throughput and shortening the overall training cycle by 56.6 percent.

As humanoid robots accelerate toward industrialization, VLA models have become a core technical pathway for embodied intelligence, enabling end-to-end connection of perception, understanding, and action. NVIDIA's open-source GR00T N series stands as a representative core technology stack for humanoid robot scenarios, widely used in robotic intelligence training and R and D deployment.

Released in 2025, GR00T N1.6 revamps both the model architecture and action-generation paradigm. It uses Cosmos-Reason-2B as its multimodal vision-language perception core and introduces a 32-layer DiT backbone for action generation, jointly modeling first-person robot video, proprioceptive state, and natural-language instructions as a shared policy representation.

However, training such a large-scale VLA model presents significant challenges. IO stalls, communication overhead, and inefficient operator scheduling have been major bottlenecks. LoongForge, Baidu Baige's end-to-end system-level optimization framework, addresses these issues through:

  1. Intelligent IO Pipeline: Eliminates storage IO bottlenecks by prefetching and caching training data, reducing idle GPU time
  2. Communication Optimization: Implements gradient compression and optimized all-reduce strategies to minimize communication overhead across distributed nodes
  3. Operator Scheduling: Automatically fuses and reorders computation operators for maximum hardware utilization on NVIDIA GPUs

The results are compelling: 2.3x training throughput improvement and a 56.6 percent reduction in total training cycle time, enabling faster iteration of embodied AI models at scale.

This optimization is particularly significant as the embodied AI field transitions from research to production. Faster training cycles mean researchers can iterate on models more rapidly, accelerating the path toward general-purpose robot intelligence.

Source: Baidu Baige / NVIDIA Developer
Language: English- Showing content in English

Share this article