Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bingyang Wang

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

May 22, 2026

Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, Han Liu

Abstract:Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

* Project page: https://sterzhang.github.io/SpaceNum-Home

Via

Access Paper or Ask Questions

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Jun 04, 2025

Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo

Figure 1 for Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Figure 2 for Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Figure 3 for Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Figure 4 for Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Abstract:Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

* Preprint under review. Project page at https://grow-ai-like-a-child.github.io/gaze/

Via

Access Paper or Ask Questions

Machine Psychophysics: Cognitive Control in Vision-Language Models

May 25, 2025

Dezhi Luo, Maijunxian Wang, Bingyang Wang, Tianwei Zhao, Yijiang Li, Hokin Deng

Figure 1 for Machine Psychophysics: Cognitive Control in Vision-Language Models

Figure 2 for Machine Psychophysics: Cognitive Control in Vision-Language Models

Figure 3 for Machine Psychophysics: Cognitive Control in Vision-Language Models

Figure 4 for Machine Psychophysics: Cognitive Control in Vision-Language Models

Abstract:Cognitive control refers to the ability to flexibly coordinate thought and action in pursuit of internal goals. A standard method for assessing cognitive control involves conflict tasks that contrast congruent and incongruent trials, measuring the ability to prioritize relevant information while suppressing interference. We evaluate 108 vision-language models on three classic conflict tasks and their more demanding "squared" variants across 2,220 trials. Model performance corresponds closely to human behavior under resource constraints and reveals individual differences. These results indicate that some form of human-like executive function have emerged in current multi-modal foundational models.

Via

Access Paper or Ask Questions

EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively

Apr 09, 2025

Bingyang Wang, Kaer Huang, Bin Li, Yiqiang Yan, Lihe Zhang, Huchuan Lu, You He

Abstract:Open-World Tracking (OWT) aims to track every object of any category, which requires the model to have strong generalization capabilities. Trackers can improve their generalization ability by leveraging Visual Language Models (VLMs). However, challenges arise with the fine-tuning strategies when VLMs are transferred to OWT: full fine-tuning results in excessive parameter and memory costs, while the zero-shot strategy leads to sub-optimal performance. To solve the problem, EffOWT is proposed for efficiently transferring VLMs to OWT. Specifically, we build a small and independent learnable side network outside the VLM backbone. By freezing the backbone and only executing backpropagation on the side network, the model's efficiency requirements can be met. In addition, EffOWT enhances the side network by proposing a hybrid structure of Transformer and CNN to improve the model's performance in the OWT field. Finally, we implement sparse interactions on the MLP, thus reducing parameter updates and memory costs significantly. Thanks to the proposed methods, EffOWT achieves an absolute gain of 5.5% on the tracking metric OWTA for unknown categories, while only updating 1.3% of the parameters compared to full fine-tuning, with a 36.4% memory saving. Other metrics also demonstrate obvious improvement.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Scanning phase imaging without accurate positioning system

Oct 31, 2023

Tao Liu, Bingyang Wang, JiangTao Zhao, Fu rong Chen, Fucai Zhang

Figure 1 for Scanning phase imaging without accurate positioning system

Figure 2 for Scanning phase imaging without accurate positioning system

Figure 3 for Scanning phase imaging without accurate positioning system

Figure 4 for Scanning phase imaging without accurate positioning system

Abstract:Ptychography, a high-resolution phase imaging technique using precise in-plane translation information, has been widely applied in modern synchrotron radiation sources across the globe. A key requirement for successful ptychographic reconstruction is the precise knowledge of the scanning positions, which are typically obtained by a physical interferometric positioning system. Whereas high-throughput positioning poses a challenge in engineering, especially in nano or even smaller scale. In this work, we propose a novel scanning imaging framework that does not require any prior position information from the positioning system. Specifically, our scheme utilizes the wavefront modulation mechanism to reconstruct the object functions at each scan position and the shared illumination function, simultaneously. The scanning trajectory information is extracted by our subpixel image registration algorithm from the overlap region of reconstructed object functions. Then, a completed object function can be obtained by assembling each part of the reconstructed sample functions. High-quality imaging of biological sample and position recovery with sub-pixel accuracy are demonstrated in proof-of-concept experiment. Based on current results, we find it may have great potential applications in high-resolution and high throughput phase imaging.

* 9 pages,4 figures

Via

Access Paper or Ask Questions