EG
Abstract visualization of 3D spatial data reconstruction with geometric mesh and point cloud representation
ResearchJune 21, 2026Embodied Global Team

Holi-Spatial: ICML 2026 Oral — Fully Automated 3D Spatial Intelligence Data Pipeline with 4M-Scale Dataset

Shanghai AI Lab, NWPU, and SJTU researchers introduce Holi-Spatial, an ICML 2026 Oral paper that transforms raw video into structured 3D spatial intelligence data automatically — building the 4M-scale Holi-Spatial-4M dataset for spatial AI training.

#3D Spatial Intelligence#ICML 2026#Automated Data Pipeline#Computer Vision#AI Research#Spatial AI
Reading in English

A research team from Shanghai Artificial Intelligence Laboratory, Northwestern Polytechnical University (NWPU), and Shanghai Jiao Tong University (SJTU) has introduced Holi-Spatial, a fully automated framework for constructing 3D spatial intelligence training data from raw video streams. The paper has been accepted as an Oral presentation at ICML 2026.

The Data Bottleneck in Spatial Intelligence While large language models have rapidly advanced in image understanding, OCR, multi-image reasoning, and video QA, they still struggle with genuine 3D spatial understanding. Capabilities like understanding object spatial relationships, camera movement estimation, and cross-view object localization require large-scale, fine-grained, geometrically constrained 3D data — a resource that has been scarce and expensive to produce.

Traditional approaches rely on manually annotated 3D datasets like ScanNet and ScanNet++, which are limited in scale and domain coverage. Holi-Spatial addresses this bottleneck by turning publicly available video data into structured spatial supervision automatically.

Three-Stage Automated Pipeline Holi-Spatial operates through a three-stage pipeline: Stage 1 — Geometric Optimization using 3D Gaussian Splatting for multi-view consistent depth and point cloud recovery. Stage 2 — Open-Vocabulary Perception using VLM-generated categories and SAM3 segmentation masks back-projected into 3D. Stage 3 — Scene-Level Refinement including multi-view merging, confidence filtering, VLM agent verification, and QA generation.

The pipeline produced Holi-Spatial-4M, a dataset containing over 4 million spatial annotations spanning 3D grounding, spatial QA, instance segmentation, and 3D detection across ScanNet, ScanNet++, and DL3DV-10K sources.

Performance and Impact Experimental results demonstrate significant quality gains. On ScanNet++, depth F1 reaches 0.89, 2D segmentation IoU reaches 0.64, and 3D detection AP25/AP50 reaches 81.06/70.05. When fine-tuned on Qwen3-VL-8B, the dataset boosts 3D grounding AP50 from 13.50 to 27.98 — a 14.48 AP point improvement.

Holi-Spatial demonstrates that raw video can be automatically converted into structured, trainable spatial intelligence data, suggesting that future improvements in spatial AI may come as much from better data systems as from larger model parameters. This has profound implications for embodied AI, AR/VR, robotics navigation, and scene understanding applications.

Paper: https://arxiv.org/abs/2603.07660 | Project: https://visionary-laboratory.github.io/holi-spatial/ | Code: https://github.com/Visionary-Laboratory/Holi-Spatial

Language: English- Showing content in English

Share this article