The Hidden Bottleneck in Embodied AI
Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in using digital APIs — booking flights, querying databases, and navigating the web. However, a new benchmark reveals a startling gap when these models are asked to interact with the physical world.
PhysTool-Bench, introduced by researchers from Singapore Management University and The Hong Kong Polytechnic University in a paper published on arXiv (2606.10803) on June 9, 2026, is the first comprehensive benchmark designed to evaluate MLLMs' ability to recognize, select, and plan the use of physical tools in real-world scenarios.
The Benchmark
PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains including manufacturing, electrical work, agriculture, and healthcare. Models are evaluated along two primary dimensions:
- Task I - Physical Tool Recognition: Identifying all tools present in a scene
- Task II - Tool Selection and Action Planning: Selecting the correct tools and placing them in the right execution order based on an instruction
Findings: A Two-Level Deficit
Across 13 leading MLLMs, the results were sobering:
| Model | Tool Recognition (Task I) | End-to-End Completion (Task II) |
|---|---|---|
| Gemini-3.1-Pro (best) | 58.7% | 21.0% |
| GPT-5.4 | ~52% | ~16% |
| Claude 4.5 Opus | ~48% | ~14% |
| Other models | 30-50% | 5-15% |
Even the strongest model — Gemini-3.1-Pro — failed to identify nearly half of all tools in a scene and completed only one-fifth of end-to-end queries.
The Core Problem: Functional Commonsense
The researchers' analysis reveals two distinct deficits:
- Perception Deficit: MLLMs struggle to perceive tools in realistic, cluttered scenes — a relatively smaller gap
- Functional Commonsense Deficit: The far larger drop occurs at the planning stage, where models fail to map perceived tools onto task semantics. Even when models correctly see a hammer, they may not understand it's the right tool for driving a nail.
This "functional commonsense" gap — the ability to connect visual recognition with practical task semantics — is identified as the central bottleneck for practical embodied AI deployment.
Implications for Embodied AI
While MLLMs increasingly serve as the "brain" of embodied AI systems, instructing robots to interact with the physical world, this research shows that the path from digital tool mastery to physical world competence is far from complete. The findings suggest that future embodied AI research must focus not just on larger models or more data, but on bridging this fundamental "functional commonsense" gap — teaching AI systems to understand not just what tools look like, but what they do and how they should be used in context.
Paper: arXiv:2606.10803 - "Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use" Authors: Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li
