Abstract:Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.




Abstract:As one of the most important paradigms of recurrent neural networks, the echo state network (ESN) has been applied to a wide range of fields, from robotics to medicine to finance, and language processing. A key feature of the ESN paradigm is its reservoir ---a directed and weighted network--- that represents the connections between neurons and projects the input signals into a high dimensional space. Despite extensive studies, the impact of the reservoir network on the ESN performance remains unclear. Here we systematically address this fundamental question. Through spectral analysis of the reservoir network we reveal a key factor that largely determines the ESN memory capacity and hence affects its performance. Moreover, we find that adding short loops to the reservoir network can tailor ESN for specific tasks and optimal learning. We validate our findings by applying ESN to forecast both synthetic and real benchmark time series. Our results provide a new way to design task-specific recurrent neural networks, as well as new insights in understanding complex networked systems.