Alert button
Picture for Xiaotao Gu

Xiaotao Gu

Alert button

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Sep 26, 2023
Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian

Figure 1 for QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Figure 2 for QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Figure 3 for QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Figure 4 for QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.

* 16 pages 
Viaarxiv icon

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

Jun 14, 2023
Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu, Jianlong Chang, Qi Tian

Figure 1 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Figure 2 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Figure 3 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Figure 4 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

The AI community has been pursuing algorithms known as artificial general intelligence (AGI) that apply to any kind of real-world problem. Recently, chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve AGI in natural language processing (NLP), but the path towards AGI in computer vision (CV) remains unclear. One may owe the dilemma to the fact that visual signals are more complex than language signals, yet we are interested in finding concrete reasons, as well as absorbing experiences from GPT and LLMs to solve the problem. In this paper, we start with a conceptual definition of AGI and briefly review how NLP solves a wide range of tasks via a chat system. The analysis inspires us that unification is the next important goal of CV. But, despite various efforts in this direction, CV is still far from a system like GPT that naturally integrates all tasks. We point out that the essential weakness of CV lies in lacking a paradigm to learn from environments, yet NLP has accomplished the task in the text world. We then imagine a pipeline that puts a CV algorithm (i.e., an agent) in world-scale, interactable environments, pre-trains it to predict future frames with respect to its action, and then fine-tunes it with instruction to accomplish various tasks. We expect substantial research and engineering efforts to push the idea forward and scale it up, for which we share our perspectives on future research directions.

* 17 pages, 14 figures, technical report, expected to be updated in the near future 
Viaarxiv icon

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

Apr 22, 2023
Xin Chen, Hengheng Zhang, Xiaotao Gu, Kaifeng Bi, Lingxi Xie, Qi Tian

Figure 1 for Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Figure 2 for Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Figure 3 for Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Figure 4 for Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

The Mixture of Experts (MoE) model becomes an important choice of large language models nowadays because of its scalability with sublinear computational complexity for training and inference. However, existing MoE models suffer from two critical drawbacks, 1) tremendous inner-node and inter-node communication overhead introduced by all-to-all dispatching and gathering, and 2) limited scalability for the backbone because of the bound data parallel and expert parallel to scale in the expert dimension. In this paper, we systematically analyze these drawbacks in terms of training efficiency in the parallel framework view and propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them. PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering with a simple tensor index slicing and inner-node all-reduce. Besides, it is convenient for PPMoE to integrate pipeline parallel to further scale the backbone due to its flexible parallel architecture. Extensive experiments show that PPMoE not only achieves a more than $1.75\times$ speed up compared to existing MoE architectures but also reaches $90\%$ throughput of its corresponding backbone model that is $20\times$ smaller.

* A novel framework for MoE models. Work in progress 
Viaarxiv icon

Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast

Nov 03, 2022
Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, Qi Tian

Figure 1 for Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast
Figure 2 for Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast
Figure 3 for Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast
Figure 4 for Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast

In this paper, we present Pangu-Weather, a deep learning based system for fast and accurate global weather forecast. For this purpose, we establish a data-driven environment by downloading $43$ years of hourly global weather data from the 5th generation of ECMWF reanalysis (ERA5) data and train a few deep neural networks with about $256$ million parameters in total. The spatial resolution of forecast is $0.25^\circ\times0.25^\circ$, comparable to the ECMWF Integrated Forecast Systems (IFS). More importantly, for the first time, an AI-based method outperforms state-of-the-art numerical weather prediction (NWP) methods in terms of accuracy (latitude-weighted RMSE and ACC) of all factors (e.g., geopotential, specific humidity, wind speed, temperature, etc.) and in all time ranges (from one hour to one week). There are two key strategies to improve the prediction accuracy: (i) designing a 3D Earth Specific Transformer (3DEST) architecture that formulates the height (pressure level) information into cubic data, and (ii) applying a hierarchical temporal aggregation algorithm to alleviate cumulative forecast errors. In deterministic forecast, Pangu-Weather shows great advantages for short to medium-range forecast (i.e., forecast time ranges from one hour to one week). Pangu-Weather supports a wide range of downstream forecast scenarios, including extreme weather forecast (e.g., tropical cyclone tracking) and large-member ensemble forecast in real-time. Pangu-Weather not only ends the debate on whether AI-based methods can surpass conventional NWP methods, but also reveals novel directions for improving deep learning weather forecast systems.

* 19 pages, 13 figures: the first ever AI-based method that outperforms traditional numerical weather prediction methods 
Viaarxiv icon

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

May 28, 2021
Xiaotao Gu, Zihan Wang, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang

Figure 1 for UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
Figure 2 for UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
Figure 3 for UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
Figure 4 for UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

* KDD 2021 
Viaarxiv icon

On the Transformer Growth for Progressive BERT Training

Oct 23, 2020
Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, Jiawei Han

Figure 1 for On the Transformer Growth for Progressive BERT Training
Figure 2 for On the Transformer Growth for Progressive BERT Training
Figure 3 for On the Transformer Growth for Progressive BERT Training
Figure 4 for On the Transformer Growth for Progressive BERT Training

As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively--start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively while achieving comparable performances. Code will be released for reproduction and future studies.

Viaarxiv icon

Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning

May 01, 2020
Deren Lei, Gangrong Jiang, Xiaotao Gu, Kexuan Sun, Yuning Mao, Xiang Ren

Figure 1 for Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Figure 2 for Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Figure 3 for Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Figure 4 for Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning

Walk-based models have shown their unique advantages in knowledge graph (KG) reasoning by achieving state-of-the-art performance while allowing for explicit visualization of the decision sequence. However, the sparse reward signals offered by the KG during a traversal are often insufficient to guide a sophisticated reinforcement learning (RL) model. An alternate approach to KG reasoning is using traditional symbolic methods (e.g., rule induction), which achieve high precision without learning but are hard to generalize due to the limitation of symbolic representation. In this paper, we propose to fuse these two paradigms to get the best of both worlds. Our method leverages high-quality rules generated by symbolic-based methods to provide reward supervision for walk-based agents. Due to the structure of symbolic rules with their entity variables, we can separate our walk-based agent into two sub-agents thus allowing for additional efficiency. Experiments on public datasets demonstrate that walk-based models can benefit from rule guidance significantly.

Viaarxiv icon

Generating Representative Headlines for News Stories

Feb 01, 2020
Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu, Cong Yu, Daniel Finnie, Jiaqi Zhai, Nicholas Zukoski

Figure 1 for Generating Representative Headlines for News Stories
Figure 2 for Generating Representative Headlines for News Stories
Figure 3 for Generating Representative Headlines for News Stories
Figure 4 for Generating Representative Headlines for News Stories

Millions of news articles are published online every day, which can be overwhelming for readers to follow. Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption. However, it remains a challenging research problem to efficiently and effectively generate a representative headline for each story. Automatic summarization of a document set has been studied for decades, while few studies have focused on generating representative headlines for a set of articles. Unlike summaries, which aim to capture most information with least redundancy, headlines aim to capture information jointly shared by the story articles in short length, and exclude information that is too specific to each individual article. In this work, we study the problem of generating representative headlines for news stories. We develop a distant supervision approach to train large-scale generation models without any human annotation. This approach centers on two technical components. First, we propose a multi-level pre-training framework that incorporates massive unlabeled corpus with different quality-vs.-quantity balance at different levels. We show that models trained within this framework outperform those trained with pure human curated corpus. Second, we propose a novel self-voting-based article attention layer to extract salient information shared by multiple articles. We show that models that incorporate this layer are robust to potential noises in news stories and outperform existing baselines with or without noises. We can further enhance our model by incorporating human labels, and we show our distant supervision approach significantly reduces the demand on labeled data.

* WebConf 2020 (WWW 2020) 
Viaarxiv icon

Learning Dynamic Context Augmentation for Global Entity Linking

Sep 04, 2019
Xiyuan Yang, Xiaotao Gu, Sheng Lin, Siliang Tang, Yueting Zhuang, Fei Wu, Zhigang Chen, Guoping Hu, Xiang Ren

Figure 1 for Learning Dynamic Context Augmentation for Global Entity Linking
Figure 2 for Learning Dynamic Context Augmentation for Global Entity Linking
Figure 3 for Learning Dynamic Context Augmentation for Global Entity Linking
Figure 4 for Learning Dynamic Context Augmentation for Global Entity Linking

Despite of the recent success of collective entity linking (EL) methods, these "global" inference methods may yield sub-optimal results when the "all-mention coherence" assumption breaks, and often suffer from high computational cost at the inference stage, due to the complex search space. In this paper, we propose a simple yet effective solution, called Dynamic Context Augmentation (DCA), for collective EL, which requires only one pass through the mentions in a document. DCA sequentially accumulates context information to make efficient, collective inference, and can cope with different local EL models as a plug-and-enhance module. We explore both supervised and reinforcement learning strategies for learning the DCA model. Extensive experiments show the effectiveness of our model with different learning settings, base models, decision orders and attention mechanisms.

Viaarxiv icon