Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuanqing Liu

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Jun 02, 2026

Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

Abstract:Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

Via

Access Paper or Ask Questions

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Feb 13, 2026

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun(+30 more)

Abstract:Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

Via

Access Paper or Ask Questions

Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings

Mar 07, 2025

Xuanqing Liu, Luyang Kong, Wei Niu, Afshin Khashei, Belinda Zeng, Steve Johnson, Jon Jay, Davor Golac, Matt Pope

Figure 1 for Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings

Figure 2 for Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings

Figure 3 for Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings

Figure 4 for Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. However, analyzing live dialogues in real-time necessitates low-latency processing systems, making it impractical to deploy models with billions of parameters due to latency constraints. As a result, practitioners often prefer smaller models with millions of parameters, trained on high-quality, human-annotated datasets. Yet, curating such datasets is both time-consuming and costly. Consequently, there is a growing need to combine the scalability of LLM-generated labels with the precision of human annotations, enabling fine-tuned smaller models to achieve both higher speed and accuracy comparable to larger models. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more. To mitigate the impact of labeling errors from LLMs -- the primary source of inaccuracies in student models -- we propose a noise-reduced preference learning loss. Experimental results demonstrate that our method significantly improves accuracy across utterance-level dialogue tasks, including sentiment detection (over $2\%$), dialogue act classification (over $1.5\%$), etc.

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

Jun 04, 2024

Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Henrik Johnson, Nimish Amlathe, Davor Golac

Figure 1 for GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

Figure 2 for GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

Figure 3 for GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

Figure 4 for GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

Abstract:Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.

* KDD 2024 Camera Ready; 11 pages, 8 figures

Via

Access Paper or Ask Questions

Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Oct 18, 2023

Liu Liu, Xuanqing Liu, Cho-Jui Hsieh, Dacheng Tao

Figure 1 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Figure 2 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Figure 3 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Figure 4 for Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function

Abstract:Trust-region (TR) and adaptive regularization using cubics (ARC) have proven to have some very appealing theoretical properties for non-convex optimization by concurrently computing function value, gradient, and Hessian matrix to obtain the next search direction and the adjusted parameters. Although stochastic approximations help largely reduce the computational cost, it is challenging to theoretically guarantee the convergence rate. In this paper, we explore a family of stochastic TR and ARC methods that can simultaneously provide inexact computations of the Hessian matrix, gradient, and function values. Our algorithms require much fewer propagations overhead per iteration than TR and ARC. We prove that the iteration complexity to achieve $\epsilon$-approximate second-order optimality is of the same order as the exact computations demonstrated in previous studies. Additionally, the mild conditions on inexactness can be met by leveraging a random sampling technology in the finite-sum minimization problem. Numerical experiments with a non-convex problem support these findings and demonstrate that, with the same or a similar number of iterations, our algorithms require less computational overhead per iteration than current second-order methods.

* arXiv admin note: text overlap with arXiv:1809.09853

Via

Access Paper or Ask Questions

Label Disentanglement in Partition-based Extreme Multilabel Classification

Jun 24, 2021

Xuanqing Liu, Wei-Cheng Chang, Hsiang-Fu Yu, Cho-Jui Hsieh, Inderjit S. Dhillon

Figure 1 for Label Disentanglement in Partition-based Extreme Multilabel Classification

Figure 2 for Label Disentanglement in Partition-based Extreme Multilabel Classification

Figure 3 for Label Disentanglement in Partition-based Extreme Multilabel Classification

Figure 4 for Label Disentanglement in Partition-based Extreme Multilabel Classification

Abstract:Partition-based methods are increasingly-used in extreme multi-label classification (XMC) problems due to their scalability to large output spaces (e.g., millions or more). However, existing methods partition the large label space into mutually exclusive clusters, which is sub-optimal when labels have multi-modality and rich semantics. For instance, the label "Apple" can be the fruit or the brand name, which leads to the following research question: can we disentangle these multi-modal labels with non-exclusive clustering tailored for downstream XMC tasks? In this paper, we show that the label assignment problem in partition-based XMC can be formulated as an optimization problem, with the objective of maximizing precision rates. This leads to an efficient algorithm to form flexible and overlapped label clusters, and a method that can alternatively optimizes the cluster assignments and the model parameters for partition-based XMC. Experimental results on synthetic and real datasets show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.

Via

Access Paper or Ask Questions

How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers

Oct 19, 2020

Yuanhao Xiong, Xuanqing Liu, Li-Cheng Lan, Yang You, Si Si, Cho-Jui Hsieh

Figure 1 for How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers

Figure 2 for How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers

Figure 3 for How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers

Figure 4 for How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers

Abstract:Many optimizers have been proposed for training deep neural networks, and they often have multiple hyperparameters, which make it tricky to benchmark their performance. In this work, we propose a new benchmarking protocol to evaluate both end-to-end efficiency (training a model from scratch without knowing the best hyperparameter) and data-addition training efficiency (the previously selected hyperparameters are used for periodically re-training the model with newly collected data). For end-to-end efficiency, unlike previous work that assumes random hyperparameter tuning, which over-emphasizes the tuning time, we propose to evaluate with a bandit hyperparameter tuning strategy. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. For data-addition training, we propose a new protocol for assessing the hyperparameter sensitivity to data shift. We then apply the proposed benchmarking framework to 7 optimizers and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining. Our results show that there is no clear winner across all the tasks.

Via

Access Paper or Ask Questions

Improving the Speed and Quality of GAN by Adversarial Training

Aug 07, 2020

Jiachen Zhong, Xuanqing Liu, Cho-Jui Hsieh

Figure 1 for Improving the Speed and Quality of GAN by Adversarial Training

Figure 2 for Improving the Speed and Quality of GAN by Adversarial Training

Figure 3 for Improving the Speed and Quality of GAN by Adversarial Training

Figure 4 for Improving the Speed and Quality of GAN by Adversarial Training

Abstract:Generative adversarial networks (GAN) have shown remarkable results in image generation tasks. High fidelity class-conditional GAN methods often rely on stabilization techniques by constraining the global Lipschitz continuity. Such regularization leads to less expressive models and slower convergence speed; other techniques, such as the large batch training, require unconventional computing power and are not widely accessible. In this paper, we develop an efficient algorithm, namely FastGAN (Free AdverSarial Training), to improve the speed and quality of GAN training based on the adversarial training technique. We benchmark our method on CIFAR10, a subset of ImageNet, and the full ImageNet datasets. We choose strong baselines such as SNGAN and SAGAN; the results demonstrate that our training algorithm can achieve better generation quality (in terms of the Inception score and Frechet Inception distance) with less overall training time. Most notably, our training algorithm brings ImageNet training to the broader public by requiring 2-4 GPUs.

Via

Access Paper or Ask Questions

Provably Robust Metric Learning

Jun 12, 2020

Lu Wang, Xuanqing Liu, Jinfeng Yi, Yuan Jiang, Cho-Jui Hsieh

Figure 1 for Provably Robust Metric Learning

Figure 2 for Provably Robust Metric Learning

Figure 3 for Provably Robust Metric Learning

Figure 4 for Provably Robust Metric Learning

Abstract:Metric learning is an important family of algorithms for classification and similarity search, but the robustness of learned metrics against small adversarial perturbations is less studied. In this paper, we show that existing metric learning algorithms, which focus on boosting the clean accuracy, can result in metrics that are less robust than the Euclidean distance. To overcome this problem, we propose a novel metric learning algorithm to find a Mahalanobis distance that is robust against adversarial perturbations, and the robustness of the resulting model is certifiable. Experimental results show that the proposed metric learning algorithm improves both certified robust errors and empirical robust errors (errors under adversarial attacks). Furthermore, unlike neural network defenses which usually encounter a trade-off between clean and robust errors, our method does not sacrifice clean errors compared with previous metric learning methods. Our code is available at https://github.com/wangwllu/provably_robust_metric_learning.

Via

Access Paper or Ask Questions

Evaluations and Methods for Explanation through Robustness Analysis

May 31, 2020

Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Ravikumar, Seungyeon Kim, Sanjiv Kumar, Cho-Jui Hsieh

Figure 1 for Evaluations and Methods for Explanation through Robustness Analysis

Figure 2 for Evaluations and Methods for Explanation through Robustness Analysis

Figure 3 for Evaluations and Methods for Explanation through Robustness Analysis

Figure 4 for Evaluations and Methods for Explanation through Robustness Analysis

Abstract:Among multiple ways of interpreting a machine learning model, measuring the importance of a set of features tied to a prediction is probably one of the most intuitive ways to explain a model. In this paper, we establish the link between a set of features to a prediction with a new evaluation criterion, robustness analysis, which measures the minimum distortion distance of adversarial perturbation. By measuring the tolerance level for an adversarial attack, we can extract a set of features that provides the most robust support for a prediction, and also can extract a set of features that contrasts the current prediction to a target class by setting a targeted adversarial attack. By applying this methodology to various prediction tasks across multiple domains, we observe the derived explanations are indeed capturing the significant feature set qualitatively and quantitatively.

Via

Access Paper or Ask Questions