A research team from Shanghai Artificial Intelligence Laboratory, Northwestern Polytechnical University (NWPU), and Shanghai Jiao Tong University (SJTU) has introduced Holi-Spatial, a fully automated framework for constructing 3D spatial intelligence training data from raw video streams. The paper has been accepted as an Oral presentation at ICML 2026.
The Data Bottleneck in Spatial Intelligence While large language models have rapidly advanced in image understanding, OCR, multi-image reasoning, and video QA, they still struggle with genuine 3D spatial understanding. Capabilities like understanding object spatial relationships, camera movement estimation, and cross-view object localization require large-scale, fine-grained, geometrically constrained 3D data — a resource that has been scarce and expensive to produce.
Traditional approaches rely on manually annotated 3D datasets like ScanNet and ScanNet++, which are limited in scale and domain coverage. Holi-Spatial addresses this bottleneck by turning publicly available video data into structured spatial supervision automatically.
Three-Stage Automated Pipeline Holi-Spatial operates through a three-stage pipeline: Stage 1 — Geometric Optimization using 3D Gaussian Splatting for multi-view consistent depth and point cloud recovery. Stage 2 — Open-Vocabulary Perception using VLM-generated categories and SAM3 segmentation masks back-projected into 3D. Stage 3 — Scene-Level Refinement including multi-view merging, confidence filtering, VLM agent verification, and QA generation.
The pipeline produced Holi-Spatial-4M, a dataset containing over 4 million spatial annotations spanning 3D grounding, spatial QA, instance segmentation, and 3D detection across ScanNet, ScanNet++, and DL3DV-10K sources.
Performance and Impact Experimental results demonstrate significant quality gains. On ScanNet++, depth F1 reaches 0.89, 2D segmentation IoU reaches 0.64, and 3D detection AP25/AP50 reaches 81.06/70.05. When fine-tuned on Qwen3-VL-8B, the dataset boosts 3D grounding AP50 from 13.50 to 27.98 — a 14.48 AP point improvement.
Holi-Spatial demonstrates that raw video can be automatically converted into structured, trainable spatial intelligence data, suggesting that future improvements in spatial AI may come as much from better data systems as from larger model parameters. This has profound implications for embodied AI, AR/VR, robotics navigation, and scene understanding applications.
Paper: https://arxiv.org/abs/2603.07660 | Project: https://visionary-laboratory.github.io/holi-spatial/ | Code: https://github.com/Visionary-Laboratory/Holi-Spatial
