EG
Humanoid robot in laboratory
ResearchJune 18, 2026Embodied Global Team

MINT-4B VLA Model Ranks Top 3 Globally, Powers Xiaozhi S2 Humanoid Robot

Guangdong Zhidong Future and Professor Cai Panpan develop MINT-4B, a multimodal VLA model ranking top 3 globally above OpenVLA and GR00T, with SDAT multi-scale frequency-domain tokenization solving cross-environment generalization, deployed in Xiaozhi S2 humanoid robots.

#MINT-4B#VLA#humanoid robot#Xiaozhi S2#SDAT#China embodied AI
Reading in English

MINT-4B: A New Contender in Embodied AI

Guangdong Zhidong Future, in collaboration with Professor Cai Panpan team at Shanghai Innovation Institute, has unveiled MINT-4B, a multimodal Vision-Language-Action (VLA) foundation model that achieved a top-3 global ranking in NVIDIA comprehensive evaluation of mainstream general-purpose robot large models, outperforming OpenVLA, GR00T, and UniVLA.

Core Innovation: SDAT Tokenization

The design philosophy is Mimic Intent Not Just Trajectories. SDAT (Scale-Decoupled Action Tokenization) decomposes actions into low-frequency tokens (global task intent) and high-frequency tokens (fine motion details) through multi-scale frequency-domain tokenization. Cross-scale autoregressive hierarchical decoding solves the VLA generalization challenge across environments without retraining.

Commercial Deployment: Xiaozhi S2

Xiaozhi S2 humanoid robot uses MINT-4B as its VLA cerebellum, deployed in education, commercial exhibitions, government services, hotels, and shopping malls for reception, navigation, and patrol, with strong cross-environment adaptation.

Future

Team continues iterating VLA algorithms to accelerate large-scale commercial humanoid robot deployment.

Source: 用户投稿
Language: English- Showing content in English