EG
AI neural network visualization showing spatial reasoning and code generation concepts with blue and purple glowing nodes
ResearchJune 20, 2026Embodied Global Team

NVIDIA Research Unveils SpatialClaw: Zero-Training Spatial Reasoning Through 'Code as Action'

NVIDIA Research unveils SpatialClaw, a zero-training approach that lets AI models generate Python code to combine perception tools for spatial reasoning, achieving 59.9% average accuracy across 20 benchmarks — outperforming existing methods by up to 11.2 percentage points.

#NVIDIA#SpatialClaw#spatial reasoning#code as action#VLM#Research
Reading in English

NVIDIA Research has unveiled SpatialClaw, a groundbreaking approach that enables AI models to perform spatial reasoning tasks without any additional training. Instead of using predefined tool calls, SpatialClaw lets models directly generate Python code that calls perception tools like Depth Anything 3 and SAM 3, freely combining their outputs.

The core innovation is a "code as action interface" design. Traditional methods require models to call predefined tools with fixed outputs — rigid and unable to freely combine. SpatialClaw allows models to write Python code on the fly, using loops, conditionals, and any combination of perception tools to solve spatial problems.

Key breakthroughs:

  • Zero-training required: Same prompts and toolset work across all backbone models (Qwen3.5/3.6, Gemma4 tested from 26B to 397B parameters)
  • 20 spatial reasoning benchmarks: Average accuracy of 59.9%, outperforming the strongest agent method SpaceTools by 11.2 percentage points, structured tool calling by 3.2 points, and no-tool baselines by 6.5 points
  • Code as interpretable reasoning: The generated code itself documents the reasoning process, making debugging straightforward
  • Fully open source: Code available on GitHub (NVlabs/SpatialClaw), requiring only compatible models and Depth Anything 3/SAM 3 perception tools

The benchmarks cover single-image understanding, multi-view reasoning, video and 4D analysis, and general spatial reasoning tasks. SpatialClaw demonstrated leadership across nearly all categories, proving that the "code interface" approach has stronger generality than fixed tools.

This work carries significant implications for the embodied AI field. Currently one of the biggest bottlenecks is the "perception-reasoning-action" loop efficiency. SpatialClaw proves that by redesigning the action interface — without increasing model parameters or training costs — spatial reasoning capabilities can be dramatically improved. For small and medium teams with limited GPU resources, this means they no longer need hundreds of GPUs to train models; they just need a different approach to writing prompts.

SpatialClaw represents a "less is more" demonstration for the entire embodied intelligence field, showing that smarter interaction design can outperform brute-force scaling.

Source: DeepTech, NVIDIA Research Blog, Toutiao
Language: English- Showing content in English