EG
Zhiyuan Conference 2026: The Consensus and Divergence in Embodied Foundation Models
ResearchJune 18, 2026Stax

Zhiyuan Conference 2026: The Consensus and Divergence in Embodied Foundation Models

#Zhiyuan Conference#BAAI#Embodied AI#VLA#World Model#China Robotics#Foundation Model#Industry Analysis#2026
Reading in French

Zhiyuan Conference 2026: The Consensus and Divergence in Embodied Foundation Models

The 8th Beijing Zhiyuan Conference (June 12-13, 2026) brought together over 200 researchers and 40+ AI CEOs from China's most prominent embodied intelligence companies. The result? A rare window into where the industry agrees — and where it fundamentally disagrees.

The One Consensus: Data is Everything

In the "Embodied Industry CEO" forum, five company leaders — Han Fengtao (Qianxun AI), Zhou Yong (Lingxin Hand), Liu Dong (Xingyuan Intelligence), Xu Huazhe (Poké Robot), and Zhu Xing (Ant Lingbo) — reached exactly one unanimous agreement across all topics discussed: data is the critical bottleneck and competitive moat in embodied AI today.

As Han Fengtao phrased it, data determines the capability boundary of embodied models. The quality, diversity, and scale of training data will decide which companies emerge as leaders.

However, even on data, disagreements surfaced below the surface. Lu Zongqing (Peking University / BeingBeyond) noted that the industry has no consensus on training paradigms — unlike LLMs where the pretrain-finetune pipeline is established, embodied AI has no equivalent proven formula. Questions abound: What kind of data for pretraining? Real-world vs simulation ratios? Scene-specific or general-purpose?

Luo Jianlan (Shanghai Innovation Institute / Zhiyuan Robotics) argued real-world data must serve as the foundation, while Ding Wenchao (Taishi Zhihang / Fudan University) pointed out that data efficiency — not just data volume — is the most undervalued metric in the industry. Many suppliers deliver poorly labeled, low-quality data that requires extensive rework.

The Great Debate: VLA vs World Models

This was the most heated technical debate at the conference, with three distinct positions:

Position 1: VLA and World Models Are Fundamentally Different

Huang Tiejun (BAAI Chair) drew a sharp line: VLA is a "patchwork of three separate models" (Vision + Language + Action), while world models are a unified system where perception, cognition, and action all live within the same model architecture. He believes world models represent the correct path to general-purpose embodied intelligence, though they remain in early stages.

BAAI backed this position with concrete releases: Physis-v0.1, described as the world's first general world foundation model, using a physical latent space representation rather than pixel-level prediction. They also released RoboBrain Orca, a unified latent space model compressing text, images, video, and action into a single semantic space.

Position 2: World Models Are a Component of VLA

Guo Yandong (ZhiPingFang CEO) took the opposing view: VLA and world models are not competing paradigms — world models are spatial perception components within VLA architecture. He redefined VLA as "a broad end-to-end model architecture driven by multi-modal data fusion," encompassing world model capabilities.

ZhiPingFang demonstrated this fusion with Video2Act (joint with Peking University), achieving 30%+ performance improvement over comparable Silicon Valley models. They also launched NeuroVLA, a "cortex-cerebellum-spinal cord" three-tier architecture inspired by biological neural systems, reducing motion jitter by 75%+ and enabling reflex responses within 20 milliseconds.

Position 3: Action Chain of Thought

Zhiyuan Robotics took yet another approach with its GO-2 embodied foundation model, introducing "Action Thought Chain" technology — enabling robots to perform internal reasoning and planning before execution, bridging the gap between semantic understanding and precise motion control. The company also open-sourced Genie Envisioner (GE), a vision-centric platform integrating prediction, policy learning, and simulation.

The Commercialization Divide

Build Fundamentals First vs Deploy Now

Han Fengtao (Qianxun AI) argued forcefully against premature commercialization. He compared current embodied models to "kindergarten-level intelligence" — deploying them at scale requires 1-2 months of per-project debugging, making unit economics impossible. His view: invest in foundation model training for the next two years until models reach "high school or college level," then scale.

Liu Dong (Xingyuan Intelligence) countered with a compelling analogy to autonomous driving. Companies that targeted L4 autonomy from day one moved slowly; those that deployed L2 systems first captured market feedback faster. He believes embodied AI must be iterated in real-world environments — "falling in deployment, fixing with feedback" — or risk developing in the wrong technical direction entirely.

To B or To C?

Most Chinese embodied AI companies target industrial/B2B scenarios. But Poké Robot (Xu Huazhe) is a notable outlier — barely 3 months old (founded March 3, 2026), it's going directly to consumer. Xu's logic: factory environments are too structured and repetitive to generate the diverse data needed for general intelligence. The unstructured, long-tail nature of home scenarios is the richest training ground.

Poké Robot has chosen a world model architecture from day one, rejecting the VLA approach. Their first hardware product targets late August/early September 2026, with a goal of entering first homes by March 2028. Their litmus test? Making a squirrel-shaped mandarin fish dish — the ultimate benchmark for fine manipulation.

The Benchmark Problem

On whether the industry needs a neutral third-party evaluation system, positions split sharply. Only Xu Huazhe voted against — his argument: any benchmark invites "teaching to the test," and physical robots are too sensitive to environmental variables (a single loose screw can crash task success rates). "Whether products sell and work in real environments is the only benchmark that matters."

Wang He (Galaxy General) proposed an alternative definition of embodied AI's "ChatGPT moment": a robot capable of completing any common human skill at 70-80% success rate, with high deployability. He predicts breakthroughs in 2-3 years, with meaningful shipment growth around late 2028 — beginning in B2B, not consumer.

What This Means

The Zhiyuan Conference revealed an industry in productive tension. There's clarity that data is the essential moat and that hardware is sufficient for many use cases. But on virtually everything else — VLA vs world models, deployment timing, evaluation standards, market segments — China's embodied AI leaders are actively, and productively, disagreeing.

This is what a field looks like before a paradigm crystallizes. The company that picks the right path through these disagreements will define the next era of embodied intelligence.


Embodied Global tracks 1,000+ articles on 53 companies across the embodied AI ecosystem. embodiedglobal.com

Sources: Beijing Business Today, InfoQ, The Paper, CGTN, Zhiyuan Conference official materials

Language: French- Showing content in French