Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tianyi Chen

School of Civil and Environmental Engineering, Nanyang Technological University, Singapore

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

Feb 23, 2025

Xiaoyi Qu, David Aponte, Colby Banbury, Daniel P. Robinson, Tianyu Ding, Kazuhito Koishida, Ilya Zharkov, Tianyi Chen

Abstract:Structured pruning and quantization are fundamental techniques used to reduce the size of deep neural networks (DNNs) and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNNs. GETA introduces three key innovations: (i) a quantization-aware dependency graph (QADG) that constructs a pruning search space for generic quantization-aware DNN, (ii) a partially projected stochastic gradient method that guarantees layerwise bit constraints are satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to existing joint pruning and quantization methods.

Via

Access Paper or Ask Questions

A First-order Generative Bilevel Optimization Framework for Diffusion Models

Feb 12, 2025

Quan Xiao, Hui Yuan, A F M Saif, Gaowen Liu, Ramana Kompella, Mengdi Wang, Tianyi Chen

Abstract:Diffusion models, which iteratively denoise data samples to synthesize high-quality outputs, have achieved empirical success across domains. However, optimizing these models for downstream tasks often involves nested bilevel structures, such as tuning hyperparameters for fine-tuning tasks or noise schedules in training dynamics, where traditional bilevel methods fail due to the infinite-dimensional probability space and prohibitive sampling costs. We formalize this challenge as a generative bilevel optimization problem and address two key scenarios: (1) fine-tuning pre-trained models via an inference-only lower-level solver paired with a sample-efficient gradient estimator for the upper level, and (2) training diffusion models from scratch with noise schedule optimization by reparameterizing the lower-level problem and designing a computationally tractable gradient estimator. Our first-order bilevel framework overcomes the incompatibility of conventional bilevel methods with diffusion processes, offering theoretical grounding and computational practicality. Experiments demonstrate that our method outperforms existing fine-tuning and hyperparameter search baselines.

Via

Access Paper or Ask Questions

ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval

Feb 11, 2025

Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi

Abstract:Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.

Via

Access Paper or Ask Questions

Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

Feb 10, 2025

Zhaoxian Wu, Quan Xian, Tayfun Gokmen, Omobayode Fagbohungbe, Tianyi Chen

Figure 1 for Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

Figure 2 for Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

Figure 3 for Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

Figure 4 for Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

Abstract:As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. Among all the physical properties of resistive elements, the response to the pulses directly affects the training dynamics. This paper first provides a theoretical foundation for gradient-based training on AIMC hardware and studies the impact of response functions. We demonstrate that noisy update and asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty term on the objective. To overcome the issue, Tiki-Taka, a residual learning algorithm, converges exactly to a critical point by optimizing a main array and a residual array bilevelly. The conclusion is supported by simulations validating our theoretical insights.

Via

Access Paper or Ask Questions

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

Dec 11, 2024

Xiaodong Cui, A F M Saif, Songtao Lu, Lisha Chen, Tianyi Chen, Brian Kingsbury, George Saon

Abstract:In this paper, we propose a bilevel joint unsupervised and supervised training (BL-JUST) framework for automatic speech recognition. Compared to the conventional pre-training and fine-tuning strategy which is a disconnected two-stage process, BL-JUST tries to optimize an acoustic model such that it simultaneously minimizes both the unsupervised and supervised loss functions. Because BL-JUST seeks matched local optima of both loss functions, acoustic representations learned by the acoustic model strike a good balance between being generic and task-specific. We solve the BL-JUST problem using penalty-based bilevel gradient descent and evaluate the trained deep neural network acoustic models on various datasets with a variety of architectures and loss functions. We show that BL-JUST can outperform the widely-used pre-training and fine-tuning strategy and some other popular semi-supervised techniques.

* Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

Via

Access Paper or Ask Questions

FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning

Dec 02, 2024

Lisha Chen, AFM Saif, Yanning Shen, Tianyi Chen

Figure 1 for FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning

Figure 2 for FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning

Figure 3 for FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning

Figure 4 for FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning

Abstract:Finding specific preference-guided Pareto solutions that represent different trade-offs among multiple objectives is critical yet challenging in multi-objective problems. Existing methods are restrictive in preference definitions and/or their theoretical guarantees. In this work, we introduce a Flexible framEwork for pREfeRence-guided multi-Objective learning (FERERO) by casting it as a constrained vector optimization problem. Specifically, two types of preferences are incorporated into this formulation -- the relative preference defined by the partial ordering induced by a polyhedral cone, and the absolute preference defined by constraints that are linear functions of the objectives. To solve this problem, convergent algorithms are developed with both single-loop and stochastic variants. Notably, this is the first single-loop primal algorithm for constrained vector optimization to our knowledge. The proposed algorithms adaptively adjust to both constraint and objective values, eliminating the need to solve different subproblems at different stages of constraint satisfaction. Experiments on multiple benchmarks demonstrate the proposed method is very competitive in finding preference-guided optimal solutions. Code is available at https://github.com/lisha-chen/FERERO/.

Via

Access Paper or Ask Questions

Primal-Dual Spectral Representation for Off-policy Evaluation

Oct 23, 2024

Yang Hu, Tianyi Chen, Na Li, Kai Wang, Bo Dai

Figure 1 for Primal-Dual Spectral Representation for Off-policy Evaluation

Figure 2 for Primal-Dual Spectral Representation for Off-policy Evaluation

Figure 3 for Primal-Dual Spectral Representation for Off-policy Evaluation

Figure 4 for Primal-Dual Spectral Representation for Off-policy Evaluation

Abstract:Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with only experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the curse of horizon. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a linear representation of value function and stationary distribution correction ratio, i.e., primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, SpectralDICE, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.

* 29 pages, 5 figures

Via

Access Paper or Ask Questions

Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

Oct 20, 2024

Heshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen

Figure 1 for Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

Figure 2 for Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

Figure 3 for Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

Figure 4 for Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

Abstract:Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage's training when undergoing the second stage's training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework with theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost. Our code is available at https://github.com/heshandevaka/XRIGHT.

Via

Access Paper or Ask Questions

Pipeline Gradient-based Model Training on Analog In-memory Accelerators

Oct 19, 2024

Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Hsinyu Tsai, Kaoutar El Maghraoui, Tianyi Chen

Figure 1 for Pipeline Gradient-based Model Training on Analog In-memory Accelerators

Figure 2 for Pipeline Gradient-based Model Training on Analog In-memory Accelerators

Figure 3 for Pipeline Gradient-based Model Training on Analog In-memory Accelerators

Figure 4 for Pipeline Gradient-based Model Training on Analog In-memory Accelerators

Abstract:Aiming to accelerate the training of large deep neural models (DNN) in an energy-efficient way, an analog in-memory computing (AIMC) accelerator emerges as a solution with immense potential. In AIMC accelerators, trainable weights are kept in memory without the need to move from memory to processors during the training, reducing a bunch of overhead. However, although the in-memory feature enables efficient computation, it also constrains the use of data parallelism since copying weights from one AIMC to another is expensive. To enable parallel training using AIMC, we propose synchronous and asynchronous pipeline parallelism for AIMC accelerators inspired by the pipeline in digital domains. This paper provides a theoretical convergence guarantee for both synchronous and asynchronous pipelines in terms of both sampling and clock cycle complexity, which is non-trivial since the physical characteristic of AIMC accelerators leads to analog updates that suffer from asymmetric bias. The simulations of training DNN on real datasets verify the efficiency of pipeline training.

Via

Access Paper or Ask Questions

Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Oct 14, 2024

Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, Jiajun Wu

Figure 1 for Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Figure 2 for Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Figure 3 for Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Figure 4 for Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Abstract:Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills. Recent advances in 3D visuomotor policies, such as the 3D Diffusion Policy (DP3), have shown promise in extending these capabilities to wilder environments. However, 3D visuomotor policies often rely on camera calibration and point-cloud segmentation, which present challenges for deployment on mobile robots like humanoids. In this work, we introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints by leveraging egocentric 3D visual representations. We demonstrate that iDP3 enables a full-sized humanoid robot to autonomously perform skills in diverse real-world scenarios, using only data collected in the lab. Videos are available at: https://humanoid-manipulation.github.io

* Project website: https://humanoid-manipulation.github.io

Via

Access Paper or Ask Questions