Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yancheng Zhu

Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

May 05, 2026

Zuoyu Zhang, Yancheng Zhu

Abstract:Tool-using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating-system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially overstating a model's ability to judge deceptive or ambiguous trajectories. To address this gap, we introduce ROME (Red-team Orchestrated Multi-agent Evolution), a controlled benchmark-construction pipeline that rewrites known unsafe trajectories into more deceptive evaluation instances while preserving their underlying risk labels. Starting from 100 unsafe source trajectories, ROME produces 300 challenge instances spanning contextual ambiguity, implicit risks, and shortcut decision-making. Experiments show that these challenge sets substantially degrade safety-judgment performance, with hidden-risk cases remaining particularly non-trivial even for recent frontier models. We further study ARISE (Analogical Reasoning for Inference-time Safety Enhancement), a retrieval-guided inference-time enhancement that retrieves ReAct-style analogical safety trajectories from an external analogical base and injects them as structured reasoning exemplars. ARISE improves judgment quality without retraining, but is best viewed as a task-specific robustness enhancement rather than a standalone safety guarantee. Together, ROME and ARISE provide practical tools for stress-testing and improving agent safety judgment under deceptive distribution shifts.

Via

Access Paper or Ask Questions

The Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment

Feb 03, 2025

Young Wu, Yancheng Zhu, Jin-Yi Cai, Xiaojin Zhu

Figure 1 for The Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment

Figure 2 for The Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment

Figure 3 for The Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment

Figure 4 for The Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment

Abstract:When multiple influencers attempt to compete for a receiver's attention, their influencing strategies must account for the presence of one another. We introduce the Battling Influencers Game (BIG), a multi-player simultaneous-move general-sum game, to provide a game-theoretic characterization of this social phenomenon. We prove that BIG is a potential game, that it has either one or an infinite number of pure Nash equilibria (NEs), and these pure NEs can be found by convex optimization. Interestingly, we also prove that at any pure NE, all (except at most one) influencers must exaggerate their actions to the maximum extent. In other words, it is rational for the influencers to be non-truthful and extreme because they anticipate other influencers to cancel out part of their influence. We discuss the implications of BIG to value alignment.

* 9 pages, 8 figures, submitted to ICML

Via

Access Paper or Ask Questions

MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

Aug 21, 2023

Guoyao Shen, Yancheng Zhu, Hernan Jara, Sean B. Andersson, Chad W. Farris, Stephan Anderson, Xin Zhang

Figure 1 for MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

Figure 2 for MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

Figure 3 for MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

Figure 4 for MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

Abstract:Recent works have demonstrated success in MRI reconstruction using deep learning-based models. However, most reported approaches require training on a task-specific, large-scale dataset. Regularization by denoising (RED) is a general pipeline which embeds a denoiser as a prior for image reconstruction. The potential of RED has been demonstrated for multiple image-related tasks such as denoising, deblurring and super-resolution. In this work, we propose a regularization by neural style transfer (RNST) method to further leverage the priors from the neural transfer and denoising engine. This enables RNST to reconstruct a high-quality image from a noisy low-quality image with different image styles and limited data. We validate RNST with clinical MRI scans from 1.5T and 3T and show that RNST can significantly boost image quality. Our results highlight the capability of the RNST framework for MRI reconstruction and the potential for reconstruction tasks with limited data.

* 30 pages, 8 figures, 2 tables, 1 algorithm chart

Via

Access Paper or Ask Questions