Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Gao

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library

Jun 06, 2025

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li(+31 more)

Abstract:We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL caters to three primary user groups: tech pioneers aiming for cost-effective, fault-tolerant large-scale training, developers requiring flexible control over training workflows, and researchers seeking agile experimentation. ROLL is built upon several key modules to serve these user groups effectively. First, a single-controller architecture combined with an abstraction of the parallel worker simplifies the development of the training pipeline. Second, the parallel strategy and data transfer modules enable efficient and scalable training. Third, the rollout scheduler offers fine-grained management of each sample's lifecycle during the rollout stage. Fourth, the environment worker and reward worker support rapid and flexible experimentation with agentic RL algorithms and reward designs. Finally, AutoDeviceMapping allows users to assign resources to different models flexibly across various stages.

* 16 pages

Via

Access Paper or Ask Questions

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

May 25, 2025

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

Abstract:Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Via

Access Paper or Ask Questions

ProMind-LLM: Proactive Mental Health Care via Causal Reasoning with Sensor Data

May 20, 2025

Xinzhe Zheng, Sijie Ji, Jiawei Sun, Renqi Chen, Wei Gao, Mani Srivastava

Abstract:Mental health risk is a critical global public health challenge, necessitating innovative and reliable assessment methods. With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications. Nevertheless, existing approaches predominantly rely on subjective textual mental records, which can be distorted by inherent mental uncertainties, leading to inconsistent and unreliable predictions. To address these limitations, this paper introduces ProMind-LLM. We investigate an innovative approach integrating objective behavior data as complementary information alongside subjective mental records for robust mental health risk assessment. Specifically, ProMind-LLM incorporates a comprehensive pipeline that includes domain-specific pretraining to tailor the LLM for mental health contexts, a self-refine mechanism to optimize the processing of numerical behavioral data, and causal chain-of-thought reasoning to enhance the reliability and interpretability of its predictions. Evaluations of two real-world datasets, PMData and Globem, demonstrate the effectiveness of our proposed methods, achieving substantial improvements over general LLMs. We anticipate that ProMind-LLM will pave the way for more dependable, interpretable, and scalable mental health case solutions.

Via

Access Paper or Ask Questions

On the Learning with Augmented Class via Forests

May 14, 2025

Fan Xu, Wuyang Chen, Wei Gao

Abstract:Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees' splitting, i.e., a new splitting criterion, called augmented Gini impurity, is introduced to exploit some unlabeled data from testing distribution. We then develop the approach named Learning with Augmented Class via Forests (LACForest), which constructs shallow forests based on the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests with a novel optimization objective based on our augmented Gini impurity, so as to utilize the representation power of neural networks for forests. Theoretically, we present the convergence analysis for augmented Gini impurity, and finally conduct experiments to verify the effectiveness of our approaches. The code is available at https://github.com/nju-xuf/LACForest/.

* Accepted by IJCAI 2025

Via

Access Paper or Ask Questions

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Mar 31, 2025

Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen

Abstract:Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.

* 21 pages, 18 figures, published to MLSys2025

Via

Access Paper or Ask Questions

UniPCGC: Towards Practical Point Cloud Geometry Compression via an Efficient Unified Approach

Mar 24, 2025

Kangli Wang, Wei Gao

Abstract:Learning-based point cloud compression methods have made significant progress in terms of performance. However, these methods still encounter challenges including high complexity, limited compression modes, and a lack of support for variable rate, which restrict the practical application of these methods. In order to promote the development of practical point cloud compression, we propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC. It is a lightweight framework that supports lossy compression, lossless compression, variable rate and variable complexity. First, we introduce the Uneven 8-Stage Lossless Coder (UELC) in the lossless mode, which allocates more computational complexity to groups with higher coding difficulty, and merges groups with lower coding difficulty. Second, Variable Rate and Complexity Module (VRCM) is achieved in the lossy mode through joint adoption of a rate modulation module and dynamic sparse convolution. Finally, through the dynamic combination of UELC and VRCM, we achieve lossy compression, lossless compression, variable rate and complexity within a unified framework. Compared to the previous state-of-the-art method, our method achieves a compression ratio (CR) gain of 8.1\% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02\% on lossy compression, while also supporting variable rate and variable complexity.

* Accepted to AAAI 2025

Via

Access Paper or Ask Questions

High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

Mar 06, 2025

Jialong Xue, Wei Gao, Yu Wang, Chao Ji, Dongdong Zhao, Shi Yan, Shiwu Zhang

Figure 1 for High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

Figure 2 for High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

Figure 3 for High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

Figure 4 for High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

Abstract:High-precision tiny object alignment remains a common and critical challenge for humanoid robots in real-world. To address this problem, this paper proposes a vision-based framework for precisely estimating and controlling the relative position between a handheld tool and a target object for humanoid robots, e.g., a screwdriver tip and a screw head slot. By fusing images from the head and torso cameras on a robot with its head joint angles, the proposed Transformer-based visual servoing method can correct the handheld tool's positional errors effectively, especially at a close distance. Experiments on M4-M8 screws demonstrate an average convergence error of 0.8-1.3 mm and a success rate of 93\%-100\%. Through comparative analysis, the results validate that this capability of high-precision tiny object alignment is enabled by the Distance Estimation Transformer architecture and the Multi-Perception-Head mechanism proposed in this paper.

* for associated video, see https://b23.tv/cklF7aK

Via

Access Paper or Ask Questions

LLM-Enhanced Multiple Instance Learning for Joint Rumor and Stance Detection with Social Context Information

Feb 13, 2025

Ruichao Yang, Jing Ma, Wei Gao, Hongzhan Lin

Abstract:The proliferation of misinformation, such as rumors on social media, has drawn significant attention, prompting various expressions of stance among users. Although rumor detection and stance detection are distinct tasks, they can complement each other. Rumors can be identified by cross-referencing stances in related posts, and stances are influenced by the nature of the rumor. However, existing stance detection methods often require post-level stance annotations, which are costly to obtain. We propose a novel LLM-enhanced MIL approach to jointly predict post stance and claim class labels, supervised solely by claim labels, using an undirected microblog propagation model. Our weakly supervised approach relies only on bag-level labels of claim veracity, aligning with multi-instance learning (MIL) principles. To achieve this, we transform the multi-class problem into multiple MIL-based binary classification problems. We then employ a discriminative attention layer to aggregate the outputs from these classifiers into finer-grained classes. Experiments conducted on three rumor datasets and two stance datasets demonstrate the effectiveness of our approach, highlighting strong connections between rumor veracity and expressed stances in responding posts. Our method shows promising performance in joint rumor and stance detection compared to the state-of-the-art methods.

* Accepted by ACM TIST

Via

Access Paper or Ask Questions

Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

Feb 06, 2025

Shangkun Sun, Xiaoyu Liang, Bowen Qu, Wei Gao

Figure 1 for Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

Figure 2 for Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

Figure 3 for Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

Figure 4 for Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

Abstract:The advent of next-generation video generation models like \textit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbf{CRAVE} (\underline{C}ontent-\underline{R}ich \underline{A}IGC \underline{V}ideo \underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at https://github.com/littlespray/CRAVE.

Via

Access Paper or Ask Questions

Rethinking stance detection: A theoretically-informed research agenda for user-level inference using language models

Feb 04, 2025

Prasanta Bhattacharya, Hong Zhang, Yiming Cao, Wei Gao, Brandon Siyuan Loh, Joseph J. P. Simons, Liang Ze Wong

Figure 1 for Rethinking stance detection: A theoretically-informed research agenda for user-level inference using language models

Abstract:Stance detection has emerged as a popular task in natural language processing research, enabled largely by the abundance of target-specific social media data. While there has been considerable research on the development of stance detection models, datasets, and application, we highlight important gaps pertaining to (i) a lack of theoretical conceptualization of stance, and (ii) the treatment of stance at an individual- or user-level, as opposed to message-level. In this paper, we first review the interdisciplinary origins of stance as an individual-level construct to highlight relevant attributes (e.g., psychological features) that might be useful to incorporate in stance detection models. Further, we argue that recent pre-trained and large language models (LLMs) might offer a way to flexibly infer such user-level attributes and/or incorporate them in modelling stance. To better illustrate this, we briefly review and synthesize the emerging corpus of studies on using LLMs for inferring stance, and specifically on incorporating user attributes in such tasks. We conclude by proposing a four-point agenda for pursuing stance detection research that is theoretically informed, inclusive, and practically impactful.

Via

Access Paper or Ask Questions