Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhang Wu

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Jul 30, 2025

Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li

Abstract:The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

Via

Access Paper or Ask Questions

Joint Resource Optimization Over Licensed and Unlicensed Spectrum in Spectrum Sharing UAV Networks Against Jamming Attacks

Jul 23, 2025

Rui Ding, Fuhui Zhou, Yuhang Wu, Qihui Wu, Tony Q. S. Quek

Abstract:Unmanned aerial vehicle (UAV) communication is of crucial importance in realizing heterogeneous practical wireless application scenarios. However, the densely populated users and diverse services with high data rate demands has triggered an increasing scarcity of UAV spectrum utilization. To tackle this problem, it is promising to incorporate the underutilized unlicensed spectrum with the licensed spectrum to boost network capacity. However, the openness of unlicensed spectrum makes UAVs susceptible to security threats from potential jammers. Therefore, a spectrum sharing UAV network coexisting with licensed cellular network and unlicensed Wi-Fi network is considered with the anti-jamming technique in this paper. The sum rate maximization of the secondary network is studied by jointly optimizing the transmit power, subchannel allocation, and UAV trajectory. We first decompose the challenging non-convex problem into two subproblems, 1) the joint power and subchannel allocation and 2) UAV trajectory design subproblems. A low-complexity iterative algorithm is proposed in a alternating optimization manner over these two subproblems to solve the formulated problem. Specifically, the Lagrange dual decomposition is exploited to jointly optimize the transmit power and subchannel allocation iteratively. Then, an efficient iterative algorithm capitalizing on successive convex approximation is designed to get a suboptimal solution for UAV trajectory. Simulation results demonstrate that our proposed algorithm can significantly improve the sum transmission rate compared with the benchmark schemes.

Via

Access Paper or Ask Questions

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

May 28, 2025

Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

Figure 1 for Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Figure 2 for Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Figure 3 for Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Figure 4 for Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Abstract:In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: $\scoreGamma$ measures basic reasoning accuracy, while $\scoreDelta$ quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) $\scoreDelta$'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.

Via

Access Paper or Ask Questions

Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

Apr 15, 2025

Nanshan Jia, Chenfei Yuan, Yuhang Wu, Zeyu Zheng

Figure 1 for Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

Figure 2 for Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

Figure 3 for Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

Figure 4 for Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

Abstract:The fast development of Large Language Models (LLMs) offers growing opportunities to further improve sequential recommendation systems. Yet for some practitioners, integrating LLMs to their existing base recommendation systems raises questions about model interpretability, transparency and related safety. To partly alleviate challenges from these questions, we propose guided embedding refinement, a method that carries out a guided and interpretable usage of LLM to enhance the embeddings associated with the base recommendation system. Instead of directly using LLMs as the backbone of sequential recommendation systems, we utilize them as auxiliary tools to emulate the sales logic of recommendation and generate guided embeddings that capture domain-relevant semantic information on interpretable attributes. Benefiting from the strong generalization capabilities of the guided embedding, we construct refined embedding by using the guided embedding and reduced-dimension version of the base embedding. We then integrate the refined embedding into the recommendation module for training and inference. A range of numerical experiments demonstrate that guided embedding is adaptable to various given existing base embedding models, and generalizes well across different recommendation tasks. The numerical results show that the refined embedding not only improves recommendation performance, achieving approximately $10\%$ to $50\%$ gains in Mean Reciprocal Rank (MRR), Recall rate, and Normalized Discounted Cumulative Gain (NDCG), but also enhances interpretability, as evidenced by case studies.

Via

Access Paper or Ask Questions

Uncertainty Quantification for LLM-Based Survey Simulations

Feb 25, 2025

Chengpiao Huang, Yuhang Wu, Kaizheng Wang

Figure 1 for Uncertainty Quantification for LLM-Based Survey Simulations

Figure 2 for Uncertainty Quantification for LLM-Based Survey Simulations

Figure 3 for Uncertainty Quantification for LLM-Based Survey Simulations

Figure 4 for Uncertainty Quantification for LLM-Based Survey Simulations

Abstract:We investigate the reliable use of simulated survey responses from large language models (LLMs) through the lens of uncertainty quantification. Our approach converts synthetic data into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.

* 30 pages, 6 figures, 10 tables

Via

Access Paper or Ask Questions

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Jan 31, 2025

Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li

Figure 1 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Figure 2 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Figure 3 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Figure 4 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Abstract:Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model's performance improves--raising a crucial question: how should the total budget on generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework to analyze budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies--particularly exponential growth policies--exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

Via

Access Paper or Ask Questions

Black-box Optimization with Simultaneous Statistical Inference for Optimal Performance

Jan 14, 2025

Teng Lian, Jian-Qiang Hu, Yuhang Wu, Zeyu Zheng

Abstract:Black-box optimization is often encountered for decision-making in complex systems management, where the knowledge of system is limited. Under these circumstances, it is essential to balance the utilization of new information with computational efficiency. In practice, decision-makers often face the dual tasks of optimization and statistical inference for the optimal performance, in order to achieve it with a high reliability. Our goal is to address the dual tasks in an online fashion. Wu et al (2022) [arXiv preprint: 2210.06737] point out that the sample average of performance estimates generated by the optimization algorithm needs not to admit a central limit theorem. We propose an algorithm that not only tackles this issue, but also provides an online consistent estimator for the variance of the performance. Furthermore, we characterize the convergence rate of the coverage probabilities of the asymptotic confidence intervals.

Via

Access Paper or Ask Questions

Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice

Dec 14, 2024

Junliang Li, Kai Ye, Haolan Kang, Mingxuan Liang, Yuhang Wu, Zhenhua Liu, Huiping Zhuang, Rui Huang, Yongquan Chen

Abstract:In recent years, as robotics has advanced, human-robot collaboration has gained increasing importance. However, current robots struggle to fully and accurately interpret human intentions from voice commands alone. Traditional gripper and suction systems often fail to interact naturally with humans, lack advanced manipulation capabilities, and are not adaptable to diverse tasks, especially in unstructured environments. This paper introduces the Embodied Dexterous Grasping System (EDGS), designed to tackle object grasping in cluttered environments for human-robot interaction. We propose a novel approach to semantic-object alignment using a Vision-Language Model (VLM) that fuses voice commands and visual information, significantly enhancing the alignment of multi-dimensional attributes of target objects in complex scenarios. Inspired by human hand-object interactions, we develop a robust, precise, and efficient grasping strategy, incorporating principles like the thumb-object axis, multi-finger wrapping, and fingertip interaction with an object's contact mechanics. We also design experiments to assess Referring Expression Representation Enrichment (RERE) in referring expression segmentation, demonstrating that our system accurately detects and matches referring expressions. Extensive experiments confirm that EDGS can effectively handle complex grasping tasks, achieving stability and high success rates, highlighting its potential for further development in the field of Embodied AI.

Via

Access Paper or Ask Questions

UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Oct 16, 2024

Jiacheng Cai, Jiahao Yu, Yangguang Shao, Yuhang Wu, Xinyu Xing

Figure 1 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Figure 2 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Figure 3 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Figure 4 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Abstract:Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse. Traditional fingerprinting methods often require significant computational overhead or white-box verification access. In this paper, we introduce UTF, a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens. Under-trained tokens are tokens that the model has not fully learned during its training phase. By utilizing these tokens, we perform supervised fine-tuning to embed specific input-output pairs into the model. This process allows the LLM to produce predetermined outputs when presented with certain inputs, effectively embedding a unique fingerprint. Our method has minimal overhead and impact on model's performance, and does not require white-box access to target model's ownership identification. Compared to existing fingerprinting methods, UTF is also more effective and robust to fine-tuning and random guess.

Via

Access Paper or Ask Questions

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Jul 31, 2024

Ziya Zhou, Yuhang Wu, Zhiyue Wu, Xinyue Zhang, Ruibin Yuan, Yinghao Ma, Lu Wang, Emmanouil Benetos, Wei Xue, Yike Guo

Figure 1 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 2 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 3 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 4 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Abstract:Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians.

* Accepted by ISMIR2024

Via

Access Paper or Ask Questions