Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guangyao Shi

University of Southern California

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

Jun 09, 2026

Qingzi Wang, Xiyang Wu, Guangyao Shi, Dianwei Chen, Xianfeng Yang, Dinesh Manocha

Abstract:Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.

Via

Access Paper or Ask Questions

A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification

Apr 26, 2026

Peng Chen, Wenxuan He, Feng Qian, Guangyao Shi, Jingwen Yan

Abstract:In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.

Via

Access Paper or Ask Questions

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Apr 22, 2026

Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, Dinesh Manocha

Abstract:Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

* 26 pages

Via

Access Paper or Ask Questions

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

Mar 26, 2026

Xiyang Wu, Guangyao Shi, Qingzi Wang, Zongxia Li, Amrit Singh Bedi, Dinesh Manocha

Abstract:Vision-language-action (VLA) models enable robots to follow natural-language instructions grounded in visual observations, but the instruction channel also introduces a critical vulnerability: small textual perturbations can alter downstream robot behavior. Systematic robustness evaluation therefore requires a black-box attacker that can generate minimal yet effective instruction edits across diverse VLA models. To this end, we present SABER, an agent-centric approach for automatically generating instruction-based adversarial attacks on VLA models under bounded edit budgets. SABER uses a GRPO-trained ReAct attacker to generate small, plausible adversarial instruction edits using character-, token-, and prompt-level tools under a bounded edit budget that induces targeted behavioral degradation, including task failure, unnecessarily long execution, and increased constraint violations. On the LIBERO benchmark across six state-of-the-art VLA models, SABER reduces task success by 20.6%, increases action-sequence length by 55%, and raises constraint violations by 33%, while requiring 21.1% fewer tool calls and 54.7% fewer character edits than strong GPT-based baselines. These results show that small, plausible instruction edits are sufficient to substantially degrade robot execution, and that an agentic black-box pipeline offers a practical, scalable, and adaptive approach for red-teaming robotic foundation models.

Via

Access Paper or Ask Questions

First Frame Is the Place to Go for Video Content Customization

Nov 19, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos

Abstract:What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

* Project Website: https://firstframego.github.io/

Via

Access Paper or Ask Questions

PIP-LLM: Integrating PDDL-Integer Programming with LLMs for Coordinating Multi-Robot Teams Using Natural Language

Oct 26, 2025

Guangyao Shi, Yuwei Wu, Vijay Kumar, Gaurav S. Sukhatme

Abstract:Enabling robot teams to execute natural language commands requires translating high-level instructions into feasible, efficient multi-robot plans. While Large Language Models (LLMs) combined with Planning Domain Description Language (PDDL) offer promise for single-robot scenarios, existing approaches struggle with multi-robot coordination due to brittle task decomposition, poor scalability, and low coordination efficiency. We introduce PIP-LLM, a language-based coordination framework that consists of PDDL-based team-level planning and Integer Programming (IP) based robot-level planning. PIP-LLMs first decomposes the command by translating the command into a team-level PDDL problem and solves it to obtain a team-level plan, abstracting away robot assignment. Each team-level action represents a subtask to be finished by the team. Next, this plan is translated into a dependency graph representing the subtasks' dependency structure. Such a dependency graph is then used to guide the robot-level planning, in which each subtask node will be formulated as an IP-based task allocation problem, explicitly optimizing travel costs and workload while respecting robot capabilities and user-defined constraints. This separation of planning from assignment allows PIP-LLM to avoid the pitfalls of syntax-based decomposition and scale to larger teams. Experiments across diverse tasks show that PIP-LLM improves plan success rate, reduces maximum and average travel costs, and achieves better load balancing compared to state-of-the-art baselines.

Via

Access Paper or Ask Questions

Multi-Agent Trust Region Policy Optimisation: A Joint Constraint Approach

Aug 14, 2025

Chak Lam Shek, Guangyao Shi, Pratap Tokekar

Abstract:Multi-agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous-Agent Trust Region Policy Optimization (HATRPO) enforces per-agent trust region constraints using Kullback-Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HATRPO-W, a Karush-Kuhn-Tucker-based (KKT-based) method that optimizes threshold assignment under global KL constraints, and HATRPO-G, a greedy algorithm that prioritizes agents based on improvement-to-divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous-agent settings. Experimental results demonstrate that our methods significantly boost the performance of HATRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HATRPO-W and HATRPO-G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HATRPO-W also demonstrates more stable learning dynamics, as reflected by its lower variance.

Via

Access Paper or Ask Questions

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

May 02, 2025

Zongxia Li, Xiyang Wu, Yubin Qin, Guangyao Shi, Hongyang Du, Dinesh Manocha, Tianyi Zhou, Jordan Lee Boyd-Graber

Abstract:Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at https://github.com/zli12321/VideoHallu.

Via

Access Paper or Ask Questions

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Jan 04, 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, Guangyao Shi

Figure 1 for Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Figure 2 for Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Figure 3 for Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Figure 4 for Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Abstract:Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.

* 34 pages, 3 figures

Via

Access Paper or Ask Questions

Inverse Risk-sensitive Multi-Robot Task Allocation

Jun 14, 2024

Guangyao Shi, Gaurav S. Sukhatme

Figure 1 for Inverse Risk-sensitive Multi-Robot Task Allocation

Figure 2 for Inverse Risk-sensitive Multi-Robot Task Allocation

Figure 3 for Inverse Risk-sensitive Multi-Robot Task Allocation

Abstract:We consider a new variant of the multi-robot task allocation problem - Inverse Risk-sensitive Multi-Robot Task Allocation (IR-MRTA). "Forward" MRTA - the process of deciding which robot should perform a task given the reward (cost)-related parameters, is widely studied in the multi-robot literature. In this setting, the reward (cost)-related parameters are assumed to be already known: parameters are first fixed offline by domain experts, followed by coordinating robots online. What if we need these parameters to be adjusted by non-expert human supervisors who oversee the robots during tasks to adapt to new situations? We are interested in the case where the human supervisor's perception of the allocation risk may change and suggest different allocations for robots compared to that from the MRTA algorithm. In such cases, the robots need to change the parameters of the allocation problem based on evolving human preferences. We study such problems through the lens of inverse task allocation, i.e., the process of finding parameters given solutions to the problem. Specifically, we propose a new formulation IR-MRTA, in which we aim to find a new set of parameters of the human behavioral risk model that minimally deviates from the current MRTA parameters and can make a greedy task allocation algorithm allocate robot resources in line with those suggested by humans. We show that even in the simple case such a problem is a non-convex optimization problem. We propose a Branch $\&$ Bound algorithm (BB-IR-MRTA) to solve such problems. In numerical simulations of a case study on multi-robot target capture, we demonstrate how to use BB-IR-MRTA and we show that the proposed algorithm achieves significant advantages in running time and peak memory usage compared to a brute-force baseline.

Via

Access Paper or Ask Questions