Alert button
Picture for Kang Min Yoo

Kang Min Yoo

Alert button

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

May 23, 2023
Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee

Figure 1 for Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Figure 2 for Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Figure 3 for Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Figure 4 for Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Parameter-efficient fine-tuning (PEFT) methods have emerged to mitigate the prohibitive cost of full fine-tuning large language models (LLMs). Nonetheless, the enormous size of LLMs impedes routine deployment. To address the issue, we present Parameter-Efficient and Quantization-aware Adaptation (PEQA), a novel quantization-aware PEFT technique that facilitates model compression and accelerates inference. PEQA operates through a dual-stage process: initially, the parameter matrix of each fully-connected layer undergoes quantization into a matrix of low-bit integers and a scalar vector; subsequently, fine-tuning occurs on the scalar vector for each downstream task. Such a strategy compresses the size of the model considerably, leading to a lower inference latency upon deployment and a reduction in the overall memory required. At the same time, fast fine-tuning and efficient task switching becomes possible. In this way, PEQA offers the benefits of quantization, while inheriting the advantages of PEFT. We compare PEQA with competitive baselines in comprehensive experiments ranging from natural language understanding to generation benchmarks. This is done using large language models of up to $65$ billion parameters, demonstrating PEQA's scalability, task-specific adaptation performance, and ability to follow instructions, even in extremely low-bit settings.

* 9 pages, 2 figures, 8 tables 
Viaarxiv icon

Aligning Large Language Models through Synthetic Feedback

May 23, 2023
Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, Minjoon Seo

Figure 1 for Aligning Large Language Models through Synthetic Feedback
Figure 2 for Aligning Large Language Models through Synthetic Feedback
Figure 3 for Aligning Large Language Models through Synthetic Feedback
Figure 4 for Aligning Large Language Models through Synthetic Feedback

Aligning large language models (LLMs) to human values has become increasingly important as it enables sophisticated steering of LLMs, e.g., making them follow given instructions while keeping them less toxic. However, it requires a significant amount of human demonstrations and feedback. Recently, open-sourced models have attempted to replicate the alignment learning process by distilling data from already aligned LLMs like InstructGPT or ChatGPT. While this process reduces human efforts, constructing these datasets has a heavy dependency on the teacher models. In this work, we propose a novel framework for alignment learning with almost no human labor and no dependency on pre-aligned LLMs. First, we perform reward modeling (RM) with synthetic feedback by contrasting responses from vanilla LLMs with various sizes and prompts. Then, we use the RM for simulating high-quality demonstrations to train a supervised policy and for further optimizing the model with reinforcement learning. Our resulting model, Aligned Language Model with Synthetic Training dataset (ALMoST), outperforms open-sourced models, including Alpaca, Dolly, and OpenAssistant, which are trained on the outputs of InstructGPT or human-annotated instructions. Our 7B-sized model outperforms the 12-13B models in the A/B tests using GPT-4 as the judge with about 75% winning rate on average.

* Preprint, 9 pages (with 10 pages of supplementary) 
Viaarxiv icon

Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning

Jan 30, 2023
Hyunsoo Cho, Choonghyun Park, Junyeop Kim, Hyuhng Joon Kim, Kang Min Yoo, Sang-goo Lee

Figure 1 for Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning
Figure 2 for Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning
Figure 3 for Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning
Figure 4 for Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning

As the size of the pre-trained language model (PLM) continues to increase, numerous parameter-efficient transfer learning methods have been proposed recently to compensate for the tremendous cost of fine-tuning. Despite the impressive results achieved by large pre-trained language models (PLMs) and various parameter-efficient transfer learning (PETL) methods on sundry benchmarks, it remains unclear if they can handle inputs that have been distributionally shifted effectively. In this study, we systematically explore how the ability to detect out-of-distribution (OOD) changes as the size of the PLM grows or the transfer methods are altered. Specifically, we evaluated various PETL techniques, including fine-tuning, Adapter, LoRA, and prefix-tuning, on three different intention classification tasks, each utilizing various language models with different scales.

* WIP 
Viaarxiv icon

Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners

Dec 28, 2022
Hyunsoo Cho, Hyuhng Joon Kim, Junyeob Kim, Sang-Woo Lee, Sang-goo Lee, Kang Min Yoo, Taeuk Kim

Figure 1 for Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners
Figure 2 for Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners
Figure 3 for Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners
Figure 4 for Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners

Through in-context learning (ICL), large-scale language models are effective few-shot learners without additional model fine-tuning. However, the ICL performance does not scale well with the number of available training samples as it is limited by the inherent input length constraint of the underlying language model. Meanwhile, many studies have revealed that language models are also powerful feature extractors, allowing them to be utilized in a black-box manner and enabling the linear probing paradigm, where lightweight discriminators are trained on top of the pre-extracted input representations. This paper proposes prompt-augmented linear probing (PALP), a hybrid of linear probing and ICL, which leverages the best of both worlds. PALP inherits the scalability of linear probing and the capability of enforcing language models to derive more meaningful representations via tailoring input into a more conceivable form. Throughout in-depth investigations on various datasets, we verified that PALP significantly enhances the input representations closing the gap between ICL in the data-hungry scenario and fine-tuning in the data-abundant scenario with little training overhead, potentially making PALP a strong alternative in a black-box scenario.

* AAAI 2023 
Viaarxiv icon

Critic-Guided Decoding for Controlled Text Generation

Dec 21, 2022
Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung

Figure 1 for Critic-Guided Decoding for Controlled Text Generation
Figure 2 for Critic-Guided Decoding for Controlled Text Generation
Figure 3 for Critic-Guided Decoding for Controlled Text Generation
Figure 4 for Critic-Guided Decoding for Controlled Text Generation

Steering language generation towards objectives or away from undesired content has been a long-standing goal in utilizing language models (LM). Recent work has demonstrated reinforcement learning and weighted decoding as effective approaches to achieve a higher level of language control and quality with pros and cons. In this work, we propose a novel critic decoding method for controlled language generation (CriticControl) that combines the strengths of reinforcement learning and weighted decoding. Specifically, we adopt the actor-critic framework to train an LM-steering critic from non-differentiable reward models. And similar to weighted decoding, our method freezes the language model and manipulates the output token distribution using called critic, improving training efficiency and stability. Evaluation of our method on three controlled generation tasks, namely topic control, sentiment control, and detoxification, shows that our approach generates more coherent and well-controlled texts than previous methods. In addition, CriticControl demonstrates superior generalization ability in zero-shot settings. Human evaluation studies also corroborate our findings.

* 11 pages, 6 figures 
Viaarxiv icon

AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Oct 08, 2022
Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim, Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung, Dongsoo Lee

Figure 1 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
Figure 2 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
Figure 3 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
Figure 4 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet. Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving cost-effective inference. To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task. Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors. During the adaptation phase, the binary values are frozen for all tasks, while the scaling factors are fine-tuned for the downstream task. We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.

* Findings of EMNLP 2022 
Viaarxiv icon

Continuous Decomposition of Granularity for Neural Paraphrase Generation

Sep 16, 2022
Xiaodong Gu, Zhaowei Zhang, Sang-Woo Lee, Kang Min Yoo, Jung-Woo Ha

Figure 1 for Continuous Decomposition of Granularity for Neural Paraphrase Generation
Figure 2 for Continuous Decomposition of Granularity for Neural Paraphrase Generation
Figure 3 for Continuous Decomposition of Granularity for Neural Paraphrase Generation
Figure 4 for Continuous Decomposition of Granularity for Neural Paraphrase Generation

While Transformers have had significant success in paragraph generation, they treat sentences as linear sequences of tokens and often neglect their hierarchical information. Prior work has shown that decomposing the levels of granularity~(e.g., word, phrase, or sentence) for input tokens has produced substantial improvements, suggesting the possibility of enhancing Transformers via more fine-grained modeling of granularity. In this work, we propose a continuous decomposition of granularity for neural paraphrase generation (C-DNPG). In order to efficiently incorporate granularity into sentence encoding, C-DNPG introduces a granularity-aware attention (GA-Attention) mechanism which extends the multi-head self-attention with: 1) a granularity head that automatically infers the hierarchical structure of a sentence by neurally estimating the granularity level of each input token; and 2) two novel attention masks, namely, granularity resonance and granularity scope, to efficiently encode granularity into attention. Experiments on two benchmarks, including Quora question pairs and Twitter URLs have shown that C-DNPG outperforms baseline models by a remarkable margin and achieves state-of-the-art results in terms of many metrics. Qualitative analysis reveals that C-DNPG indeed captures fine-grained levels of granularity with effectiveness.

* Accepted to be published in COLING 2022 
Viaarxiv icon

Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator

Jun 16, 2022
Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim, Taeuk Kim, Kang Min Yoo, Sang-goo Lee

Figure 1 for Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator
Figure 2 for Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator
Figure 3 for Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator
Figure 4 for Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator

Large-scale pre-trained language models (PLMs) are well-known for being capable of solving a task simply by conditioning a few input-label pairs dubbed demonstrations on a prompt without being explicitly tuned for the desired downstream task. Such a process (i.e., in-context learning), however, naturally leads to high reliance on the demonstrations which are usually selected from external datasets. In this paper, we propose self-generated in-context learning (SG-ICL), which generates demonstrations for in-context learning from PLM itself to minimize the reliance on the external demonstration. We conduct experiments on four different text classification tasks and show SG-ICL significantly outperforms zero-shot learning and is generally worth approximately 0.6 gold training samples. Moreover, our generated demonstrations show more consistent performance with low variance compared to randomly selected demonstrations from the training dataset.

* NAACL 2022 Workshop on Large-scale Pre-trained Language Models 
Viaarxiv icon