SFI‑Bench finds multimodal LLMs fall short on spatial and Functional Reasoning

News

5/7/2026, 6:02:02 AM

SFI‑Bench finds multimodal LLMs fall short on spatial and Functional Reasoning

A May 2026 paper introduces SFI‑Bench, a video‑centered evaluation with more than 1,700 question items designed to probe whether multimodal LLMs can reason about object roles and context‑dependent function, not just geometry.

A paper published in May 2026 presents the Spatial — Functional Intelligence Benchmark (SFI‑Bench), a video‑centered evaluation created to probe higher‑order spatial cognition in multimodal large language models (MLLMs). The authors frame SFI‑Bench as an effort to push models past purely geometric perception toward reasoning about what objects are for in context, not only where they are.

SFI‑Bench contains over 1,700 question items drawn from diverse egocentric indoor video scans and is built to exercise two complementary dimensions. Structured Spatial Reasoning measures a model’s ability to form coherent spatial representations and handle complex layouts; Functional Reasoning asks models to infer affordances and context‑dependent uses that determine how objects are employed in different situations. Task formats intentionally demand multi‑step integration of perception, memory, and inference. Items include conditional counting that depends on prior visual states, multi‑hop relational reasoning across frames, functional pairing to match objects with plausible uses, and knowledge‑grounded troubleshooting that requires linking observed scenes to external or commonsense knowledge.

The authors situate SFI‑Bench alongside prior suites that emphasize geometric perception, noting gaps those benchmarks leave. For example, VSI‑Bench focuses on foundational spatial geometry, while earlier work such as SPACE (March 5, 2025) explored emergent spatial cognition in leading models. The new benchmark is designed to expose limits in context‑dependent function and multi‑step reasoning that prior tests do not systematically probe.

In the experiments reported, contemporary multimodal LLMs consistently underperform on tasks that require integrating long‑term spatial memory with functional and external knowledge, producing a clear performance bottleneck. The shortfall has practical implications: systems intended to navigate, manipulate, or instruct in physical environments will likely need architectures and training strategies that explicitly fuse long‑term spatial memory, relational reasoning, and knowledge grounding to operate reliably.

The paper lists authors Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal and Bo‑Hsiang Tseng. Footnotes indicate affiliations with Mila (Université de Montréal) and New York University and note that some work was completed while an author was at Apple. The authors present SFI‑Bench as a measurement tool to track progress and to guide future model design toward more cognitively capable, grounded multimodal agents.

Sources

Apple Machine Learning Research · 5/6/2026

Replies (0)

No replies in this topic yet.

Back