Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yan Gang

PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Mar 17, 2026

Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa

Abstract:Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.

* Code and dataset will be available at https://github.com/hisasnow/PhysQuantAgent

Via

Access Paper or Ask Questions

Tailoring Artificial Neural Networks for Optimal Learning

Jul 08, 2017

Pau Vilimelis Aceituno, Yan Gang, Yang-Yu Liu

Figure 1 for Tailoring Artificial Neural Networks for Optimal Learning

Figure 2 for Tailoring Artificial Neural Networks for Optimal Learning

Figure 3 for Tailoring Artificial Neural Networks for Optimal Learning

Figure 4 for Tailoring Artificial Neural Networks for Optimal Learning

Abstract:As one of the most important paradigms of recurrent neural networks, the echo state network (ESN) has been applied to a wide range of fields, from robotics to medicine to finance, and language processing. A key feature of the ESN paradigm is its reservoir ---a directed and weighted network--- that represents the connections between neurons and projects the input signals into a high dimensional space. Despite extensive studies, the impact of the reservoir network on the ESN performance remains unclear. Here we systematically address this fundamental question. Through spectral analysis of the reservoir network we reveal a key factor that largely determines the ESN memory capacity and hence affects its performance. Moreover, we find that adding short loops to the reservoir network can tailor ESN for specific tasks and optimal learning. We validate our findings by applying ESN to forecast both synthetic and real benchmark time series. Our results provide a new way to design task-specific recurrent neural networks, as well as new insights in understanding complex networked systems.

* 37 pages, 7 figures

Via

Access Paper or Ask Questions