Towards Foundation Models for 3D Vision: How Close are We?

Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (International Conference on 3D Vision. Online) 3DV pp. 1285 - 1296
Main Authors	Zuo, Yiming, Kayan, Karhan, Wang, Maggie, Jeon, Kevin, Deng, Jia, Griffiths, Thomas L.
Format	Conference Proceeding
Language	English
Published	IEEE 25.03.2025
Subjects	benchmark dataset foundation model Foundation models human subject research Neural networks Perturbation methods Question answering (information retrieval) Reliability Solid modeling Three-dimensional displays Transformers Visual systems Visualization
Online Access	Get full text
ISSN	2475-7888
DOI	10.1109/3DV66043.2025.00122

Cover

More Information
Summary:	Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT [17] align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at https://github.com/princeton-vl/UniQA-3D.
ISSN:	2475-7888
DOI:	10.1109/3DV66043.2025.00122