MINT-4B: A New Contender in Embodied AI
Guangdong Zhidong Future, in collaboration with Professor Cai Panpan team at Shanghai Innovation Institute, has unveiled MINT-4B, a multimodal Vision-Language-Action (VLA) foundation model that achieved a top-3 global ranking in NVIDIA comprehensive evaluation of mainstream general-purpose robot large models, outperforming OpenVLA, GR00T, and UniVLA.
Core Innovation: SDAT Tokenization
The design philosophy is Mimic Intent Not Just Trajectories. SDAT (Scale-Decoupled Action Tokenization) decomposes actions into low-frequency tokens (global task intent) and high-frequency tokens (fine motion details) through multi-scale frequency-domain tokenization. Cross-scale autoregressive hierarchical decoding solves the VLA generalization challenge across environments without retraining.
Commercial Deployment: Xiaozhi S2
Xiaozhi S2 humanoid robot uses MINT-4B as its VLA cerebellum, deployed in education, commercial exhibitions, government services, hotels, and shopping malls for reception, navigation, and patrol, with strong cross-environment adaptation.
Future
Team continues iterating VLA algorithms to accelerate large-scale commercial humanoid robot deployment.

