Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zixu Zhang

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Dec 28, 2024

Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang

Figure 1 for On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Figure 2 for On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Figure 3 for On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Figure 4 for On the Compositional Generalization of Multimodal LLMs for Medical Imaging

Abstract:Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at https://github.com/FreedomIntelligence/Med-MAT.

Via

Access Paper or Ask Questions

Versatile Scene-Consistent Traffic Scenario Generation as Optimization with Diffusion

Apr 03, 2024

Zhiyu Huang, Zixu Zhang, Ameya Vaidya, Yuxiao Chen, Chen Lv, Jaime Fernández Fisac

Abstract:Generating realistic and controllable agent behaviors in traffic simulation is crucial for the development of autonomous vehicles. This problem is often formulated as imitation learning (IL) from real-world driving data by either directly predicting future trajectories or inferring cost functions with inverse optimal control. In this paper, we draw a conceptual connection between IL and diffusion-based generative modeling and introduce a novel framework Versatile Behavior Diffusion (VBD) to simulate interactive scenarios with multiple traffic participants. Our model not only generates scene-consistent multi-agent interactions but also enables scenario editing through multi-step guidance and refinement. Experimental evaluations show that VBD achieves state-of-the-art performance on the Waymo Sim Agents benchmark. In addition, we illustrate the versatility of our model by adapting it to various applications. VBD is capable of producing scenarios conditioning on priors, integrating with model-based optimization, sampling multi-modal scene-consistent scenarios by fusing marginal predictions, and generating safety-critical scenarios when combined with a game-theoretic solver.

Via

Access Paper or Ask Questions

Blending Data-Driven Priors in Dynamic Games

Feb 23, 2024

Justin Lidard, Haimin Hu, Asher Hancock, Zixu Zhang, Albert Gimó Contreras, Vikash Modi, Jonathan DeCastro, Deepak Gopinath, Guy Rosman, Naomi Leonard(+2 more)

Figure 1 for Blending Data-Driven Priors in Dynamic Games

Figure 2 for Blending Data-Driven Priors in Dynamic Games

Figure 3 for Blending Data-Driven Priors in Dynamic Games

Figure 4 for Blending Data-Driven Priors in Dynamic Games

Abstract:As intelligent robots like autonomous vehicles become increasingly deployed in the presence of people, the extent to which these systems should leverage model-based game-theoretic planners versus data-driven policies for safe, interaction-aware motion planning remains an open question. Existing dynamic game formulations assume all agents are task-driven and behave optimally. However, in reality, humans tend to deviate from the decisions prescribed by these models, and their behavior is better approximated under a noisy-rational paradigm. In this work, we investigate a principled methodology to blend a data-driven reference policy with an optimization-based game-theoretic policy. We formulate KLGame, a type of non-cooperative dynamic game with Kullback-Leibler (KL) regularization with respect to a general, stochastic, and possibly multi-modal reference policy. Our method incorporates, for each decision maker, a tunable parameter that permits modulation between task-driven and data-driven behaviors. We propose an efficient algorithm for computing multimodal approximate feedback Nash equilibrium strategies of KLGame in real time. Through a series of simulated and real-world autonomous driving scenarios, we demonstrate that KLGame policies can more effectively incorporate guidance from the reference policy and account for noisily-rational human behaviors versus non-regularized baselines.

* 19 pages, 11 figures

Via

Access Paper or Ask Questions

Introspective Planning: Guiding Language-Enabled Agents to Refine Their Own Uncertainty

Feb 18, 2024

Kaiqu Liang, Zixu Zhang, Jaime Fernández Fisac

Abstract:Large language models (LLMs) exhibit advanced reasoning skills, enabling robots to comprehend natural language instructions and strategically plan high-level actions through proper grounding. However, LLM hallucination may result in robots confidently executing plans that are misaligned with user goals or, in extreme cases, unsafe. Additionally, inherent ambiguity in natural language instructions can induce task uncertainty, particularly in situations where multiple valid options exist. To address this issue, LLMs must identify such uncertainty and proactively seek clarification. This paper explores the concept of introspective planning as a systematic method for guiding LLMs in forming uncertainty--aware plans for robotic task execution without the need for fine-tuning. We investigate uncertainty quantification in task-level robot planning and demonstrate that introspection significantly improves both success rates and safety compared to state-of-the-art LLM-based planning approaches. Furthermore, we assess the effectiveness of introspective planning in conjunction with conformal prediction, revealing that this combination yields tighter confidence bounds, thereby maintaining statistical success guarantees with fewer superfluous user clarification queries.

* 22 pages, 15 figures

Via

Access Paper or Ask Questions

Who Plays First? Optimizing the Order of Play in Stackelberg Games with Many Robots

Feb 14, 2024

Haimin Hu, Gabriele Dragotto, Zixu Zhang, Kaiqu Liang, Bartolomeo Stellato, Jaime F. Fisac

Abstract:We consider the multi-agent spatial navigation problem of computing the socially optimal order of play, i.e., the sequence in which the agents commit to their decisions, and its associated equilibrium in an N-player Stackelberg trajectory game. We model this problem as a mixed-integer optimization problem over the space of all possible Stackelberg games associated with the order of play's permutations. To solve the problem, we introduce Branch and Play (B&P), an efficient and exact algorithm that provably converges to a socially optimal order of play and its Stackelberg equilibrium. As a subroutine for B&P, we employ and extend sequential trajectory planning, i.e., a popular multi-agent control approach, to scalably compute valid local Stackelberg equilibria for any given order of play. We demonstrate the practical utility of B&P to coordinate air traffic control, swarm formation, and delivery vehicle fleets. We find that B&P consistently outperforms various baselines, and computes the socially optimal equilibrium.

Via

Access Paper or Ask Questions

Learning-Aware Safety for Interactive Autonomy

Sep 03, 2023

Haimin Hu, Zixu Zhang, Kensuke Nakamura, Andrea Bajcsy, Jaime F. Fisac

Figure 1 for Learning-Aware Safety for Interactive Autonomy

Figure 2 for Learning-Aware Safety for Interactive Autonomy

Figure 3 for Learning-Aware Safety for Interactive Autonomy

Figure 4 for Learning-Aware Safety for Interactive Autonomy

Abstract:One of the outstanding challenges for the widespread deployment of robotic systems like autonomous vehicles is ensuring safe interaction with humans without sacrificing efficiency. Existing safety analysis methods often neglect the robot's ability to learn and adapt at runtime, leading to overly conservative behavior. This paper proposes a new closed-loop paradigm for synthesizing safe control policies that explicitly account for the system's evolving uncertainty under possible future scenarios. The formulation reasons jointly about the physical dynamics and the robot's learning algorithm, which updates its internal belief over time. We leverage adversarial deep reinforcement learning (RL) for scaling to high dimensions, enabling tractable safety analysis even for implicit learning dynamics induced by state-of-the-art prediction models. We demonstrate our framework's ability to work with both Bayesian belief propagation and the implicit learning induced by a large pre-trained neural trajectory predictor.

* Conference on Robot Learning 2023

Via

Access Paper or Ask Questions

Segmentation of fundus vascular images based on a dual-attention mechanism

May 05, 2023

Yuanyuan Peng, Pengpeng Luan, Zixu Zhang

Abstract:Accurately segmenting blood vessels in retinal fundus images is crucial in the early screening, diagnosing, and evaluating some ocular diseases. However, significant light variations and non-uniform contrast in these images make segmentation quite challenging. Thus, this paper employ an attention fusion mechanism that combines the channel attention and spatial attention mechanisms constructed by Transformer to extract information from retinal fundus images in both spatial and channel dimensions. To eliminate noise from the encoder image, a spatial attention mechanism is introduced in the skip connection. Moreover, a Dropout layer is employed to randomly discard some neurons, which can prevent overfitting of the neural network and improve its generalization performance. Experiments were conducted on publicly available datasets DERIVE, STARE, and CHASEDB1. The results demonstrate that our method produces satisfactory results compared to some recent retinal fundus image segmentation algorithms.

* 17 pages,6 figures

Via

Access Paper or Ask Questions

Automatic segmentation of novel coronavirus pneumonia lesions in CT images utilizing deep-supervised ensemble learning network

Nov 17, 2021

Yuanyuan Peng, Zixu Zhang, Hongbin Tu, Xiong Li

Figure 1 for Automatic segmentation of novel coronavirus pneumonia lesions in CT images utilizing deep-supervised ensemble learning network

Figure 2 for Automatic segmentation of novel coronavirus pneumonia lesions in CT images utilizing deep-supervised ensemble learning network

Figure 3 for Automatic segmentation of novel coronavirus pneumonia lesions in CT images utilizing deep-supervised ensemble learning network

Figure 4 for Automatic segmentation of novel coronavirus pneumonia lesions in CT images utilizing deep-supervised ensemble learning network

Abstract:Background: The 2019 novel coronavirus disease (COVID-19) has been spread widely in the world, causing a huge threat to people's living environment. Objective: Under computed tomography (CT) imaging, the structure features of COVID-19 lesions are complicated and varied greatly in different cases. To accurately locate COVID-19 lesions and assist doctors to make the best diagnosis and treatment plan, a deep-supervised ensemble learning network is presented for COVID-19 lesion segmentation in CT images. Methods: Considering the fact that a large number of COVID-19 CT images and the corresponding lesion annotations are difficult to obtained, a transfer learning strategy is employed to make up for the shortcoming and alleviate the overfitting problem. Based on the reality that traditional single deep learning framework is difficult to extract COVID-19 lesion features effectively, which may cause some lesions to be undetected. To overcome the problem, a deep-supervised ensemble learning network is presented to combine with local and global features for COVID-19 lesion segmentation. Results: The performance of the proposed method was validated in experiments with a publicly available dataset. Compared with manual annotations, the proposed method acquired a high intersection over union (IoU) of 0.7279. Conclusion: A deep-supervised ensemble learning network was presented for coronavirus pneumonia lesion segmentation in CT images. The effectiveness of the proposed method was verified by visual inspection and quantitative evaluation. Experimental results shown that the proposed mehtod has a perfect performance in COVID-19 lesion segmentation.

Via

Access Paper or Ask Questions

Safe Occlusion-aware Autonomous Driving via Game-Theoretic Active Perception

May 17, 2021

Zixu Zhang, Jaime F. Fisac

Figure 1 for Safe Occlusion-aware Autonomous Driving via Game-Theoretic Active Perception

Figure 2 for Safe Occlusion-aware Autonomous Driving via Game-Theoretic Active Perception

Figure 3 for Safe Occlusion-aware Autonomous Driving via Game-Theoretic Active Perception

Figure 4 for Safe Occlusion-aware Autonomous Driving via Game-Theoretic Active Perception

Abstract:Autonomous vehicles interacting with other traffic participants heavily rely on the perception and prediction of other agents' behaviors to plan safe trajectories. However, as occlusions limit the vehicle's perception ability, reasoning about potential hazards beyond the field-of-view is one of the most challenging issues in developing autonomous driving systems. This paper introduces a novel analytical approach that poses the problem of safe trajectory planning under occlusions as a hybrid zero-sum dynamic game between the autonomous vehicle (evader), and an initially hidden traffic participant (pursuer). Due to occlusions, the pursuer's state is initially unknown to the evader and may later be discovered by the vehicle's sensors. The analysis yields optimal strategies for both players as well as the set of initial conditions from which the autonomous vehicle is guaranteed to avoid collisions. We leverage this theoretical result to develop a novel trajectory planning framework for autonomous driving that provides worst-case safety guarantees while minimizing conservativeness by accounting for the vehicle's ability to actively avoid other road users as soon as they are detected in future observations. Our framework is agnostic to the driving environment and suitable for various motion planners. We demonstrate our algorithm on challenging urban and highway driving scenarios using the open-source CARLA simulator. The experimental results can be found in https://youtu.be/Cdm1T6Iv7GI.

* To be appeared in Robotics: Science and Systems (RSS), 2021

Via

Access Paper or Ask Questions

Pixel-Wise Motion Deblurring of Thermal Videos

Jun 08, 2020

Manikandasriram Srinivasan Ramanagopal, Zixu Zhang, Ram Vasudevan, Matthew Johnson-Roberson

Figure 1 for Pixel-Wise Motion Deblurring of Thermal Videos

Figure 2 for Pixel-Wise Motion Deblurring of Thermal Videos

Figure 3 for Pixel-Wise Motion Deblurring of Thermal Videos

Figure 4 for Pixel-Wise Motion Deblurring of Thermal Videos

Abstract:Uncooled microbolometers can enable robots to see in the absence of visible illumination by imaging the "heat" radiated from the scene. Despite this ability to see in the dark, these sensors suffer from significant motion blur. This has limited their application on robotic systems. As described in this paper, this motion blur arises due to the thermal inertia of each pixel. This has meant that traditional motion deblurring techniques, which rely on identifying an appropriate spatial blur kernel to perform spatial deconvolution, are unable to reliably perform motion deblurring on thermal camera images. To address this problem, this paper formulates reversing the effect of thermal inertia at a single pixel as a Least Absolute Shrinkage and Selection Operator (LASSO) problem which we can solve rapidly using a quadratic programming solver. By leveraging sparsity and a high frame rate, this pixel-wise LASSO formulation is able to recover motion deblurred frames of thermal videos without using any spatial information. To compare its quality against state-of-the-art visible camera based deblurring methods, this paper evaluated the performance of a family of pre-trained object detectors on a set of images restored by different deblurring algorithms. All evaluated object detectors performed systematically better on images restored by the proposed algorithm rather than any other tested, state-of-the-art methods.

* 10 pages, 8 figures, Accepted to Robotics: Science and Systems 2020

Via

Access Paper or Ask Questions