โ† Back to blog

NVIDIA Built an AI Agent That Reasons About 3D Space โ€” Without Any Training

โญ Featured

NVIDIA Built an AI Agent That Reasons About 3D Space โ€” Without Any Training

NVIDIA Research just released SpatialClaw, and the core idea is deceptively simple: instead of training a model to understand 3D space, let the agent write code that calls vision tools to figure it out.

No extra training. No fine-tuning. Just a general language model, a set of off-the-shelf perception tools, and code as the interface between them.

The result: 59.9% average accuracy across 20 spatial benchmarks โ€” 11.2 percentage points better than the previous best spatial agent.

The Problem SpatialClaw Is Solving

Vision-language models are surprisingly bad at 3D spatial reasoning. Ask one "which object is to the left of the blue cube?" and it might get it right sometimes. Ask something more complex โ€” involving depth, relative positions, multiple objects โ€” and accuracy drops fast.

The standard fix is to train a specialized model on lots of 3D spatial data. That works, but it's expensive, slow, and the resulting model doesn't generalize well outside its training distribution.

SpatialClaw takes a different approach: don't train for spatial reasoning at all. Instead, give the agent tools that can perceive depth and segment objects, and let it write code to combine those outputs into an answer.

Code as the Action Interface

The key architectural decision in SpatialClaw is treating code as the "action interface" โ€” the way the agent interacts with the world.

When the agent needs to reason about a 3D scene, it doesn't try to answer directly. It writes Python code that calls depth estimation tools (Depth Anything 3) and segmentation tools (SAM 3), processes the outputs, and derives the spatial relationship from the data.

This is a meaningful shift from how most agents work. Most agents either call a tool and get back a text result, or produce a direct answer. SpatialClaw's agent writes code that orchestrates multiple tool calls and synthesizes the results programmatically. The code itself becomes the reasoning step.

What It Runs On

One of the more surprising aspects of SpatialClaw: it works across a wide range of base models, from 26B to 397B parameters. The researchers tested it on Qwen3.5, Qwen3.6, and Gemma4 variants.

This matters because it means the spatial reasoning capability doesn't live in the model weights โ€” it lives in the framework. Swap in a better base model and you get better spatial reasoning for free. The agent architecture is the durable part.

The benchmarks back this up. SpatialClaw with a medium-sized model outperforms specialized spatial agents that were purpose-built and fine-tuned specifically for these tasks.

How the Numbers Stack Up

On 20 spatial reasoning benchmarks:

  • 59.9% average accuracy for SpatialClaw
  • 48.7% for SpaceTools (the previous best spatial agent)
  • 53.4% for structured tool calling without code
  • 56.7% for specialized fine-tuned models

That's a clean win across the board โ€” and again, with no task-specific training.

The gap vs. fine-tuned models is particularly notable. SpatialClaw beats models that were specifically trained for spatial tasks, using only a general model plus external tools at inference time.

What This Means If You Use OpenClaw

SpatialClaw is a research paper, not a product you can plug in today. But it demonstrates something that's directly relevant to anyone building with AI agents.

The pattern โ€” "use code as the action interface, call tools to gather information, synthesize programmatically" โ€” is exactly how well-designed agents handle complex tasks in any domain. You don't need a model that magically knows everything. You need an agent that knows how to use tools to find out what it doesn't know, and code is a powerful way to combine those outputs.

OpenClaw is built around the same intuition. When you wire up skills and tools in OpenClaw, you're giving the agent the ability to call out for information and combine it โ€” the same architectural bet that NVIDIA just validated in a completely different domain.

The other takeaway: training-free approaches are getting very competitive with specialized fine-tuned models. That trend is good news for anyone building agent workflows. The general-purpose models you already use keep getting more capable, without requiring expensive retraining cycles.

Start your free trial โ†’