Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Tang

Tony

StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error

Mar 13, 2025

Shu-Xun Yang, Cunxiang Wang, Yidong Wang, Xiaotao Gu, Minlie Huang, Jie Tang

Abstract:Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at https://github.com/SHU-XUN/StepMathAgent.

Via

Access Paper or Ask Questions

CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

Mar 10, 2025

Xin Liu, Jie Liu, Jie Tang, Gangshan Wu

Abstract:Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.

* Accepted by CVPR2025

Via

Access Paper or Ask Questions

AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

Mar 03, 2025

Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu

Abstract:In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. Moreover, their reliance on fixed sampling patterns limits both accuracy and the ability to capture fine details in low-resolution images. To address these challenges, we introduce two plug-and-play modules designed to capture and leverage pixel information effectively in Look-Up Table (LUT) based super-resolution networks. Our method introduces Automatic Sampling (AutoSample), a flexible LUT sampling approach where sampling weights are automatically learned during training to adapt to pixel variations and expand the receptive field without added inference cost. We also incorporate Adaptive Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed information flow and improving the network's ability to reconstruct fine details. Our method achieves significant performance improvements on both MuLUT and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT, we achieve a PSNR improvement of approximately +0.20 dB improvement on average across five datasets. For SPF-LUT, with more than a 50% reduction in storage space and about a 2/3 reduction in inference time, our method still maintains performance comparable to the original. The code is available at https://github.com/SuperKenVery/AutoLUT.

Via

Access Paper or Ask Questions

Linear Model of RIS-Aided High-Mobility Communication System

Feb 28, 2025

Shuaijun Li, Jie Tang, Beixiong Zheng, Xiaokai Song, Guixin Pan, Kai-Kit Wong

Figure 1 for Linear Model of RIS-Aided High-Mobility Communication System

Figure 2 for Linear Model of RIS-Aided High-Mobility Communication System

Figure 3 for Linear Model of RIS-Aided High-Mobility Communication System

Figure 4 for Linear Model of RIS-Aided High-Mobility Communication System

Abstract:Reconfigurable intelligent surface (RIS)-aided vehicle-to-everything (V2X) communication has emerged as a crucial solution for providing reliable data services to vehicles on the road. However, in delay-sensitive or high-mobility communications, the rapid movement of vehicles can lead to random scattering in the environment and time-selective fading in the channel. In view of this, we investigate in this paper an innovative linear model with low-complexity transmitter signal design and receiver detection methods, which boost stability in fast-fading environments and reduce channel training overhead. Specifically, considering the differences in hardware design and signal processing at the receiving end between uplink and downlink communication systems, distinct solutions are proposed. Accordingly, we first integrate the Rician channel introduced by the RIS with the corresponding signal processing algorithms to model the RIS-aided downlink communication system as a Doppler-robust linear model. Inspired by this property, we design a precoding scheme based on the linear model to reduce the complexity of precoding. Then, by leveraging the linear model and the large-scale antenna array at the base station (BS) side, we improve the linear model for the uplink communication system and derive its asymptotic performance in closed-form. Simulation results demonstrate the performance advantages of the proposed RIS-aided high-mobility communication system compared to other benchmark schemes.

* IEEE Transactions on Vehicular Technology, 25 February 2025
* accepted in IEEE Transactions on Vehicular Technology( Early Access )

Via

Access Paper or Ask Questions

LongSafety: Evaluating Long-Context Safety of Large Language Models

Feb 24, 2025

Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang, Xiaotao Gu, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang

Figure 1 for LongSafety: Evaluating Long-Context Safety of Large Language Models

Figure 2 for LongSafety: Evaluating Long-Context Safety of Large Language Models

Figure 3 for LongSafety: Evaluating Long-Context Safety of Large Language Models

Figure 4 for LongSafety: Evaluating Long-Context Safety of Large Language Models

Abstract:As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data are available at https://github.com/thu-coai/LongSafety.

Via

Access Paper or Ask Questions

A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models

Feb 20, 2025

Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang

Abstract:In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.

Via

Access Paper or Ask Questions

DataSciBench: An LLM Agent Benchmark for Data Science

Feb 19, 2025

Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue

Figure 1 for DataSciBench: An LLM Agent Benchmark for Data Science

Figure 2 for DataSciBench: An LLM Agent Benchmark for Data Science

Figure 3 for DataSciBench: An LLM Agent Benchmark for Data Science

Figure 4 for DataSciBench: An LLM Agent Benchmark for Data Science

Abstract:This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.

* 40 pages, 7 figures, 6 tables

Via

Access Paper or Ask Questions

HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

Feb 18, 2025

Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang

Figure 1 for HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

Figure 2 for HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

Figure 3 for HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

Figure 4 for HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

Abstract:Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods.

* 32 pages, 10 figures

Via

Access Paper or Ask Questions

RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts

Feb 18, 2025

Yu Zhang, Shujun Peng, Nengwu Wu, Xinhan Lin, Yang Hu, Jie Tang

Abstract:Recently, substantial advancements have been made in training language models to carry out step-by-step reasoning for solving intricate numerical reasoning tasks. Beyond the methods used to solve these problems, the structure and formulation of the problems themselves also play a crucial role in determining the performance of large language models. We observe that even small changes in the surface form of mathematical problems can have a profound impact on both the answer distribution and solve rate. This highlights the vulnerability of LLMs to surface-level variations, revealing its limited robustness when reasoning through complex problems. In this paper, we propose RM-PoT, a three-stage framework that integrates problem reformulation (RM), code-aided reasoning (PoT), and domain-aware few-shot learning to address these limitations. Our approach first reformulates the input problem into diverse surface forms to reduce structural bias, then retrieves five semantically aligned examples from a pre-constructed domain-specific question bank to provide contextual guidance, and finally generates executable Python code for precise computation.

Via

Access Paper or Ask Questions

Small Language Model Makes an Effective Long Text Extractor

Feb 11, 2025

Yelin Chen, Fanjin Zhang, Jie Tang

Figure 1 for Small Language Model Makes an Effective Long Text Extractor

Figure 2 for Small Language Model Makes an Effective Long Text Extractor

Figure 3 for Small Language Model Makes an Effective Long Text Extractor

Figure 4 for Small Language Model Makes an Effective Long Text Extractor

Abstract:Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: https://github.com/THUDM/scholar-profiling/tree/main/sener

* AAAI'25, 9 pages, 1 appendix pages

Via

Access Paper or Ask Questions