Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Li

Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Oct 27, 2024

Xinhong Xie, Tao Li, Quanyan Zhu

Figure 1 for Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Figure 2 for Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Figure 3 for Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Figure 4 for Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Abstract:Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the challenge of incomplete preference, we propose Stackelberg response optimization (SRO), adapted from DPO, to enable the LLM to learn from the follower's response. The gist is that SRO decreases the likelihood of generating the paraphrase if it fails the follower's screening while performing DPO on the pair of the toxic input and its paraphrase when the latter passes the screening. Experiments indicate that the SRO-fine-tunned LLM achieves satisfying performance comparable to state-of-the-art models regarding style accuracy, content similarity, and fluency. The overall detoxification performance surpasses other computing methods and matches the human reference. Additional empirical evidence suggests that SRO is sensitive to the screener's feedback, and a slight perturbation leads to a significant performance drop. We release the code and LLM models at \url{https://github.com/XXXinhong/Detoxification_LLM}.

Via

Access Paper or Ask Questions

Influential Language Data Selection via Gradient Trajectory Pursuit

Oct 22, 2024

Zhiwei Deng, Tao Li, Yang Li

Abstract:Curating a desirable dataset for training has been the core of building highly capable large language models (Touvron et al., 2023; Achiam et al., 2023; Team et al.,2024). Gradient influence scores (Pruthi et al., 2020; Xia et al., 2024) are shown to be correlated with model performance and are commonly used as the criterion for data selection. However, existing methods are built upon either individual sample rankings or inefficient matching process, leading to suboptimal performance or scaling up issues.In this paper, we propose Gradient Trajectory Pursuit (GTP), an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective. The proposed algorithm highlights: (1) joint selection instead of independent top-k selection, which automatically de-duplicates samples; (2) higher efficiency with compressive sampling processes, which can be further sped up using a distributed framework. In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks and show that it outperforms top-k selection and competitive algorithms consistently, for example, our algorithm chooses as low as 0.5% data to achieve full performance on the targeted instruction tuning tasks

Via

Access Paper or Ask Questions

Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks

Oct 22, 2024

Tao Li, Henger Li, Yunian Pan, Tianyi Xu, Zizhan Zheng, Quanyan Zhu

Figure 1 for Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks

Figure 2 for Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks

Figure 3 for Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks

Figure 4 for Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks

Abstract:Federated learning (FL) is susceptible to a range of security threats. Although various defense mechanisms have been proposed, they are typically non-adaptive and tailored to specific types of attacks, leaving them insufficient in the face of multiple uncertain, unknown, and adaptive attacks employing diverse strategies. This work formulates adversarial federated learning under a mixture of various attacks as a Bayesian Stackelberg Markov game, based on which we propose the meta-Stackelberg defense composed of pre-training and online adaptation. {The gist is to simulate strong attack behavior using reinforcement learning (RL-based attacks) in pre-training and then design meta-RL-based defense to combat diverse and adaptive attacks.} We develop an efficient meta-learning approach to solve the game, leading to a robust and adaptive FL defense. Theoretically, our meta-learning algorithm, meta-Stackelberg learning, provably converges to the first-order $\varepsilon$-meta-equilibrium point in $O(\varepsilon^{-2})$ gradient iterations with $O(\varepsilon^{-4})$ samples per iteration. Experiments show that our meta-Stackelberg framework performs superbly against strong model poisoning and backdoor attacks of uncertain and unknown types.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

Oct 11, 2024

Yanfeng Jiang, Zelan Yang, Bohua Chen, Shen Li, Yong Li, Tao Li

Figure 1 for DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

Figure 2 for DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

Figure 3 for DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

Figure 4 for DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

Abstract:Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. However, the diversity of downstream tasks and practical requirements makes deploying multiple full-parameter fine-tuned models challenging. Current methods that compress the delta weight struggle to achieve ultra-high compression, failing to minimize the deployment overhead. To address the above issue, we propose a novel distribution-driven delta compression framework DeltaDQ, which utilizes Group-wise Dropout and Separate Quantization to achieve ultra-high compression for the delta weight. We have observed that the matrix-computed intermediate results for the delta weight exhibit extremely small variance and min-max range characteristics, referred to as Balanced Intermediate Results. Exploiting this phenomenon, we introduce Group-wise Dropout to perform dropout on the delta weight using an optimal group size. Furthermore, using Separate Quantization, sparse weights are quantized and decomposed to achieve a lower bit. Experimental results show that DeltaDQ achieves 16x compression with improved accuracy compared to baselines for WizardMath and WizardCoder models across different parameter scales. Moreover, DeltaDQ demonstrates the ability for ultra-high compression ratio, achieving 128x compression for the WizardMath-7B model and 512x compression for the WizardMath-70B model.

Via

Access Paper or Ask Questions

Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement

Sep 29, 2024

Zhehao Huang, Xinwen Cheng, JingHao Zheng, Haoran Wang, Zhengbao He, Tao Li, Xiaolin Huang

Figure 1 for Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement

Figure 2 for Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement

Figure 3 for Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement

Figure 4 for Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement

Abstract:Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks. Approximate MU is a practical method for large-scale models. Our investigation into approximate MU starts with identifying the steepest descent direction, minimizing the output Kullback-Leibler divergence to exact MU inside a parameters' neighborhood. This probed direction decomposes into three components: weighted forgetting gradient ascent, fine-tuning retaining gradient descent, and a weight saliency matrix. Such decomposition derived from Euclidean metric encompasses most existing gradient-based MU methods. Nevertheless, adhering to Euclidean space may result in sub-optimal iterative trajectories due to the overlooked geometric structure of the output probability space. We suggest embedding the unlearning update into a manifold rendered by the remaining geometry, incorporating second-order Hessian from the remaining data. It helps prevent effective unlearning from interfering with the retained performance. However, computing the second-order Hessian for large-scale models is intractable. To efficiently leverage the benefits of Hessian modulation, we propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction. Free from specific modal constraints, our approach is adaptable across computer vision unlearning tasks, including classification and generation. Extensive experiments validate our efficacy and efficiency. Notably, our method successfully performs class-forgetting on ImageNet using DiT and forgets a class on CIFAR-10 using DDPM in just 50 steps, compared to thousands of steps required by previous methods.

* Accepted by NeurIPS 2024 as a Spotlight paper

Via

Access Paper or Ask Questions

Emu3: Next-Token Prediction is All You Need

Sep 27, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu(+15 more)

Figure 1 for Emu3: Next-Token Prediction is All You Need

Figure 2 for Emu3: Next-Token Prediction is All You Need

Figure 3 for Emu3: Next-Token Prediction is All You Need

Figure 4 for Emu3: Next-Token Prediction is All You Need

Abstract:While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

* Project Page: https://emu.baai.ac.cn

Via

Access Paper or Ask Questions

Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Sep 22, 2024

Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, Xiaolin Huang

Figure 1 for Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Figure 2 for Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Figure 3 for Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Figure 4 for Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

Abstract:Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA's performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach.

* Work in progress

Via

Access Paper or Ask Questions

EDCSSM: Edge Detection with Convolutional State Space Model

Sep 03, 2024

Qinghui Hong, Haoyou Jiang, Pingdan Xiao, Sichun Du, Tao Li

Figure 1 for EDCSSM: Edge Detection with Convolutional State Space Model

Figure 2 for EDCSSM: Edge Detection with Convolutional State Space Model

Figure 3 for EDCSSM: Edge Detection with Convolutional State Space Model

Figure 4 for EDCSSM: Edge Detection with Convolutional State Space Model

Abstract:Edge detection in images is the foundation of many complex tasks in computer graphics. Due to the feature loss caused by multi-layer convolution and pooling architectures, learning-based edge detection models often produce thick edges and struggle to detect the edges of small objects in images. Inspired by state space models, this paper presents an edge detection algorithm which effectively addresses the aforementioned issues. The presented algorithm obtains state space variables of the image from dual-input channels with minimal down-sampling processes and utilizes these state variables for real-time learning and memorization of image text. Additionally, to achieve precise edges while filtering out false edges, a post-processing algorithm called wind erosion has been designed to handle the binary edge map. To further enhance the processing speed of the algorithm, we have designed parallel computing circuits for the most computationally intensive parts of presented algorithm, significantly improving computational speed and efficiency. Experimental results demonstrate that the proposed algorithm achieves precise thin edge localization and exhibits noise suppression capabilities across various types of images. With the parallel computing circuits, the algorithm to achieve processing speeds exceeds 30 FPS on 5K images.

Via

Access Paper or Ask Questions

Power Control and Random Serving Mode Allocation for CJT-NCJT Hybrid Mode Enabled Cell-Free Massive MIMO With Limited Fronthauls

Sep 03, 2024

Hangyu Zhang, Rui Zhang, Yongzhao Li, Yuhan Ruan, Tao Li, Dong Yang

Figure 1 for Power Control and Random Serving Mode Allocation for CJT-NCJT Hybrid Mode Enabled Cell-Free Massive MIMO With Limited Fronthauls

Figure 2 for Power Control and Random Serving Mode Allocation for CJT-NCJT Hybrid Mode Enabled Cell-Free Massive MIMO With Limited Fronthauls

Abstract:With a great potential of improving the service fairness and quality for user equipments (UEs), cell-free massive multiple-input multiple-output (mMIMO) has been regarded as an emerging candidate for 6G network architectures. Under ideal assumptions, the coherent joint transmission (CJT) serving mode has been considered as an optimal option for cell-free mMIMO systems, since it can achieve coherent cooperation gain among the access points. However, when considering the limited fronthaul constraint in practice, the non-coherent joint transmission (NCJT) serving mode is likely to outperform CJT, since the former requires much lower fronthaul resources. In other words, the performance excellence and worseness of single serving mode (CJT or NCJT) depends on the fronthaul capacity, and any single transmission mode cannot perfectly adapt the capacity limited fronthaul. To explore the performance potential of the cell-free mMIMO system with limited fronthauls by harnessing the merits of CJT and NCJT, we propose a CJT-NCJT hybrid serving mode framework, in which UEs are allocated to operate on CJT or NCJT serving mode. To improve the sum-rate of the system with low complexity, we first propose a probability-based random serving mode allocation scheme. With a given serving mode, a successive convex approximation-based power allocation algorithm is proposed to maximize the system's sum-rate. Simulation results demonstrate the superiority of the proposed scheme.

* 6 pages, 2 figures, accepted by GLOBECOM 2024

Via

Access Paper or Ask Questions

A Recursion-Based SNR Determination Method for Short Packet Transmission: Analysis and Applications

Aug 23, 2024

Chengzhe Yin, Rui Zhang, Yongzhao Li, Yuhan Ruan, Tao Li, Jiaheng Lu

Figure 1 for A Recursion-Based SNR Determination Method for Short Packet Transmission: Analysis and Applications

Figure 2 for A Recursion-Based SNR Determination Method for Short Packet Transmission: Analysis and Applications

Figure 3 for A Recursion-Based SNR Determination Method for Short Packet Transmission: Analysis and Applications

Figure 4 for A Recursion-Based SNR Determination Method for Short Packet Transmission: Analysis and Applications

Abstract:The short packet transmission (SPT) has gained much attention in recent years. In SPT, the most significant characteristic is that the finite blocklength code (FBC) is adopted. With FBC, the signal-to-noise ratio (SNR) cannot be expressed as an explicit function with respect to the other transmission parameters. This raises the following two problems for the resource allocation in SPTs: (i) The exact value of the SNR is hard to determine, and (ii) The property of SNR w.r.t. the other parameters is hard to analyze, which hinders the efficient optimization of them. To simultaneously tackle these problems, we have developed a recursion method in our prior work. To emphasize the significance of this method, we further analyze the convergence rate of the recursion method and investigate the property of the recursion function in this paper. Specifically, we first analyze the convergence rate of the recursion method, which indicates it can determine the SNR with low complexity. Then, we analyze the property of the recursion function, which facilitates the optimization of the other parameters during the recursion. Finally, we also enumerate some applications for the recursion method. Simulation results indicate that the recursion method converges faster than the other SNR determination methods. Besides, the results also show that the recursion-based methods can almost achieve the optimal solution of the application cases.

Via

Access Paper or Ask Questions