Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tian Ding

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Jan 24, 2025

Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu(+1 more)

Abstract:Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.

Via

Access Paper or Ask Questions

Enabling Scalable Oversight via Self-Evolving Critic

Jan 10, 2025

Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu(+1 more)

Abstract:Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3\% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT's performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.

Via

Access Paper or Ask Questions

An Efficient Unsupervised Framework for Convex Quadratic Programs via Deep Unrolling

Dec 02, 2024

Linxin Yang, Bingheng Li, Tian Ding, Jianghua Wu, Akang Wang, Yuyi Wang, Jiliang Tang, Ruoyu Sun, Xiaodong Luo

Abstract:Quadratic programs (QPs) arise in various domains such as machine learning, finance, and control. Recently, learning-enhanced primal-dual hybrid gradient (PDHG) methods have shown great potential in addressing large-scale linear programs; however, this approach has not been extended to QPs. In this work, we focus on unrolling "PDQP", a PDHG algorithm specialized for convex QPs. Specifically, we propose a neural network model called "PDQP-net" to learn optimal QP solutions. Theoretically, we demonstrate that a PDQP-net of polynomial size can align with the PDQP algorithm, returning optimal primal-dual solution pairs. We propose an unsupervised method that incorporates KKT conditions into the loss function. Unlike the standard learning-to-optimize framework that requires optimization solutions generated by solvers, our unsupervised method adjusts the network weights directly from the evaluation of the primal-dual gap. This method has two benefits over supervised learning: first, it helps generate better primal-dual gap since the primal-dual gap is in the objective function; second, it does not require solvers. We show that PDQP-net trained in this unsupervised manner can effectively approximate optimal QP solutions. Extensive numerical experiments confirm our findings, indicating that using PDQP-net predictions to warm-start PDQP can achieve up to 45% acceleration on QP instances. Moreover, it achieves 14% to 31% acceleration on out-of-distribution instances.

Via

Access Paper or Ask Questions

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

Jul 31, 2024

Yupeng Chen, Senmiao Wang, Zhihang Lin, Zeyu Qin, Yushun Zhang, Tian Ding, Ruoyu Sun

Abstract:Recently, large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget the knowledge acquired in the pre-training stage, leading to a decline in general capabilities. To address this issue, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). The key idea of MoFO is to iteratively select and update the model parameters with the largest momentum magnitudes. Compared to full-parameter training, MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model, thereby mitigating knowledge forgetting. Unlike most existing methods for forgetting mitigation, MoFO combines the following two advantages. First, MoFO does not require access to pre-training data. This makes MoFO particularly suitable for fine-tuning scenarios where pre-training data is unavailable, such as fine-tuning checkpoint-only open-source LLMs. Second, MoFO does not alter the original loss function. This could avoid impairing the model performance on the fine-tuning tasks. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its superiority over existing methods in mitigating forgetting and enhancing fine-tuning performance.

Via

Access Paper or Ask Questions

Adam-mini: Use Fewer Learning Rates To Gain More

Jun 26, 2024

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

Abstract:We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq$ 90% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Via

Access Paper or Ask Questions

PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

Jun 04, 2024

Bingheng Li, Linxin Yang, Yupeng Chen, Senmiao Wang, Qian Chen, Haitao Mao, Yao Ma, Akang Wang, Tian Ding, Jiliang Tang(+1 more)

Figure 1 for PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

Figure 2 for PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

Figure 3 for PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

Figure 4 for PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

Abstract:Solving large-scale linear programming (LP) problems is an important task in various areas such as communication networks, power systems, finance and logistics. Recently, two distinct approaches have emerged to expedite LP solving: (i) First-order methods (FOMs); (ii) Learning to optimize (L2O). In this work, we propose an FOM-unrolled neural network (NN) called PDHG-Net, and propose a two-stage L2O method to solve large-scale LP problems. The new architecture PDHG-Net is designed by unrolling the recently emerged PDHG method into a neural network, combined with channel-expansion techniques borrowed from graph neural networks. We prove that the proposed PDHG-Net can recover PDHG algorithm, thus can approximate optimal solutions of LP instances with a polynomial number of neurons. We propose a two-stage inference approach: first use PDHG-Net to generate an approximate solution, and then apply PDHG algorithm to further improve the solution. Experiments show that our approach can significantly accelerate LP solving, achieving up to a 3$\times$ speedup compared to FOMs for large-scale LP problems.

* Accepted by ICML 2024

Via

Access Paper or Ask Questions

Why Transformers Need Adam: A Hessian Perspective

Feb 26, 2024

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo

Abstract:SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation of SGD's failure on Transformers through the lens of Hessian: (i) Transformers are ``heterogeneous'': the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call ``block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs badly on problems with block heterogeneity. To validate that heterogeneity hampers SGD, we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD works well on problems without block heterogeneity but performs badly when the heterogeneity exists. Our initial theoretical analysis indicates that SGD fails because it applies one single learning rate for all blocks, which cannot handle the heterogeneity among blocks. The failure could be rescued if we could assign different learning rates across blocks, as designed in Adam.

Via

Access Paper or Ask Questions

Federated Learning with Lossy Distributed Source Coding: Analysis and Optimization

Apr 23, 2022

Huiyuan Yang, Tian Ding, Xiaojun Yuan

Figure 1 for Federated Learning with Lossy Distributed Source Coding: Analysis and Optimization

Figure 2 for Federated Learning with Lossy Distributed Source Coding: Analysis and Optimization

Figure 3 for Federated Learning with Lossy Distributed Source Coding: Analysis and Optimization

Figure 4 for Federated Learning with Lossy Distributed Source Coding: Analysis and Optimization

Abstract:Recently, federated learning (FL), which replaces data sharing with model sharing, has emerged as an efficient and privacy-friendly paradigm for machine learning (ML). A main challenge of FL is its huge uplink communication cost. In this paper, we tackle this challenge from an information-theoretic perspective. Specifically, we put forth a distributed source coding (DSC) framework for FL uplink, which unifies the encoding, transmission, and aggregation of the local updates as a lossy DSC problem, thus providing a systematic way to exploit the correlation between local updates to improve the uplink efficiency. Under this DSC-FL framework, we propose an FL uplink scheme based on the modified Berger-Tung coding (MBTC), which supports separate encoding and joint decoding by modifying the achievability scheme of the Berger-Tung inner bound. The achievable region of the MBTC-based uplink scheme is also derived. To unleash the potential of the MBTC-based FL scheme, we carry out a convergence analysis and then formulate a convergence rate maximization problem to optimize the parameters of MBTC. To solve this problem, we develop two algorithms, respectively for small- and large-scale FL systems, based on the majorization-minimization (MM) technique. Numerical results demonstrate the superiority of the MBTC-based FL scheme in terms of aggregation distortion, convergence performance, and communication cost, revealing the great potential of the DSC-FL framework.

Via

Access Paper or Ask Questions

The Global Landscape of Neural Networks: An Overview

Jul 02, 2020

Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, R Srikant

Figure 1 for The Global Landscape of Neural Networks: An Overview

Figure 2 for The Global Landscape of Neural Networks: An Overview

Figure 3 for The Global Landscape of Neural Networks: An Overview

Figure 4 for The Global Landscape of Neural Networks: An Overview

Abstract:One of the major concerns for neural network training is that the non-convexity of the associated loss functions may cause bad landscape. The recent success of neural networks suggests that their loss landscape is not too bad, but what specific results do we know about the landscape? In this article, we review recent findings and results on the global landscape of neural networks. First, we point out that wide neural nets may have sub-optimal local minima under certain assumptions. Second, we discuss a few rigorous results on the geometric properties of wide networks such as "no bad basin", and some modifications that eliminate sub-optimal local minima and/or decreasing paths to infinity. Third, we discuss visualization and empirical explorations of the landscape for practical neural nets. Finally, we briefly discuss some convergence results and their relation to landscape results.

* 16 pages. 8 figures

Via

Access Paper or Ask Questions

Sub-Optimal Local Minima Exist for Almost All Over-parameterized Neural Networks

Nov 04, 2019

Tian Ding, Dawei Li, Ruoyu Sun

Figure 1 for Sub-Optimal Local Minima Exist for Almost All Over-parameterized Neural Networks

Abstract:Does over-parameterization eliminate sub-optimal local minima for neural network problems? On one hand, existing positive results do not prove the claim, but often weaker claims. On the other hand, existing negative results have strong assumptions on the activation functions and/or data samples, causing a large gap with positive results. It was unclear before whether there is a clean answer of "yes" or "no". In this paper, we answer this question with a strong negative result. In particular, we prove that for deep and over-parameterized networks, sub-optimal local minima exist for generic input data samples and generic nonlinear activation. This is the setting widely studied in the global landscape of over-parameterized networks, thus our result corrects a possible misconception that "over-parameterization eliminates sub-optimal local-min". Our construction is based on fundamental optimization analysis, and thus rather principled.

* 31 pages. An early version was submitted to Optimization Online on October 4

Via

Access Paper or Ask Questions