Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chencheng Tang

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Apr 22, 2025

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu(+42 more)

Abstract:We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

* 21 pages ,8 figures, 4 tables

Via

Access Paper or Ask Questions

Efficiently Obtaining Reachset Conformance for the Formal Analysis of Robotic Contact Tasks

Oct 14, 2024

Chencheng Tang, Matthias Althoff

Figure 1 for Efficiently Obtaining Reachset Conformance for the Formal Analysis of Robotic Contact Tasks

Figure 2 for Efficiently Obtaining Reachset Conformance for the Formal Analysis of Robotic Contact Tasks

Figure 3 for Efficiently Obtaining Reachset Conformance for the Formal Analysis of Robotic Contact Tasks

Figure 4 for Efficiently Obtaining Reachset Conformance for the Formal Analysis of Robotic Contact Tasks

Abstract:Formal verification of robotic tasks requires a simple yet conformant model of the used robot. We present the first work on generating reachset conformant models for robotic contact tasks considering hybrid (mixed continuous and discrete) dynamics. Reachset conformance requires that the set of reachable outputs of the abstract model encloses all previous measurements to transfer safety properties. Aiming for industrial applications, we describe the system using a simple hybrid automaton with linear dynamics. We inject non-determinism into the continuous dynamics and the discrete transitions, and we optimally identify all model parameters together with the non-determinism required to capture the recorded behaviors. Using two 3-DOF robots, we show that our approach can effectively generate models to capture uncertainties in system behavior and substantially reduce the required testing effort in industrial applications.

* Accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

Via

Access Paper or Ask Questions

Formal Verification of Robotic Contact Tasks via Reachability Analysis

Jul 26, 2023

Chencheng Tang, Matthias Althoff

Abstract:Verifying the correct behavior of robots in contact tasks is challenging due to model uncertainties associated with contacts. Standard methods for testing often fall short since all (uncountable many) solutions cannot be obtained. Instead, we propose to formally and efficiently verify robot behaviors in contact tasks using reachability analysis, which enables checking all the reachable states against user-provided specifications. To this end, we extend the state of the art in reachability analysis for hybrid (mixed discrete and continuous) dynamics subject to discrete-time input trajectories. In particular, we present a novel and scalable guard intersection approach to reliably compute the complex behavior caused by contacts. We model robots subject to contacts as hybrid automata in which crucial time delays are included. The usefulness of our approach is demonstrated by verifying safe human-robot interaction in the presence of constrained collisions, which was out of reach for existing methods.

* This work has been accepted by the 22nd IFAC World Congress (2023 in Yokohama, Japan)

Via

Access Paper or Ask Questions