Alert button
Picture for Yashar Mehdad

Yashar Mehdad

Alert button

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

May 29, 2023
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra

Figure 1 for LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Figure 2 for LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Figure 3 for LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Figure 4 for LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and support long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings.

Viaarxiv icon

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

May 04, 2023
Xilun Chen, Lili Yu, Wenhan Xiong, Barlas Oğuz, Yashar Mehdad, Wen-tau Yih

Figure 1 for VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Figure 2 for VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Figure 3 for VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Figure 4 for VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and video question answering: A generative encoder-decoder model is first jointly pre-trained on massive image-text data to learn fundamental vision-language concepts, and then adapted to video data in an intermediate video-text pre-training stage to learn video-specific skills such as spatio-temporal reasoning. As a result, our VideoOFA model achieves new state-of-the-art performance on four Video Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr score. It also outperforms existing models on two open-ended Video Question Answering datasets, showcasing its generalization capability as a universal video-to-text model.

Viaarxiv icon

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Feb 15, 2023
Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen

Figure 1 for How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval
Figure 2 for How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval
Figure 3 for How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval
Figure 4 for How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

Viaarxiv icon

STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension

Dec 24, 2022
Borui Wang, Chengcheng Feng, Arjun Nair, Madelyn Mao, Jai Desai, Asli Celikyilmaz, Haoran Li, Yashar Mehdad, Dragomir Radev

Figure 1 for STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension
Figure 2 for STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension
Figure 3 for STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension
Figure 4 for STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension

Abstractive dialogue summarization has long been viewed as an important standalone task in natural language processing, but no previous work has explored the possibility of whether abstractive dialogue summarization can also be used as a means to boost an NLP system's performance on other important dialogue comprehension tasks. In this paper, we propose a novel type of dialogue summarization task - STRUctured DiaLoguE Summarization - that can help pre-trained language models to better understand dialogues and improve their performance on important dialogue comprehension tasks. We further collect human annotations of STRUDEL summaries over 400 dialogues and introduce a new STRUDEL dialogue comprehension modeling framework that integrates STRUDEL into a graph-neural-network-based dialogue reasoning module over transformer encoder language models to improve their dialogue comprehension abilities. In our empirical experiments on two important downstream dialogue comprehension tasks - dialogue question answering and dialogue response prediction - we show that our STRUDEL dialogue comprehension model can significantly improve the dialogue comprehension performance of transformer encoder language models.

* EMNLP 2022 
Viaarxiv icon

Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences

Dec 19, 2022
Asish Ghoshal, Arash Einolghozati, Ankit Arun, Haoran Li, Lili Yu, Yashar Mehdad, Scott Wen-tau Yih, Asli Celikyilmaz

Figure 1 for Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences
Figure 2 for Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences
Figure 3 for Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences
Figure 4 for Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences

Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20\% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.

Viaarxiv icon

CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval

Nov 18, 2022
Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen

Figure 1 for CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval
Figure 2 for CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval
Figure 3 for CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval
Figure 4 for CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval

Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers and have achieved state-of-the-art performance on various retrieval tasks. These methods, however, are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval. CITADEL learns to route different token vectors to the predicted lexical ``keys'' such that a query token vector only interacts with document token vectors routed to the same key. This design significantly reduces the computation cost while maintaining high accuracy. Notably, CITADEL achieves the same or slightly better performance than the previous state of the art, ColBERT-v2, on both in-domain (MS MARCO) and out-of-domain (BEIR) evaluations, while being nearly 40 times faster. Code and data are available at https://github.com/facebookresearch/dpr-scale.

Viaarxiv icon

Bridging the Training-Inference Gap for Dense Phrase Retrieval

Oct 25, 2022
Gyuwan Kim, Jinhyuk Lee, Barlas Oguz, Wenhan Xiong, Yizhe Zhang, Yashar Mehdad, William Yang Wang

Figure 1 for Bridging the Training-Inference Gap for Dense Phrase Retrieval
Figure 2 for Bridging the Training-Inference Gap for Dense Phrase Retrieval
Figure 3 for Bridging the Training-Inference Gap for Dense Phrase Retrieval
Figure 4 for Bridging the Training-Inference Gap for Dense Phrase Retrieval

Building dense retrievers requires a series of standard procedures, including training and validating neural models and creating indexes for efficient search. However, these procedures are often misaligned in that training objectives do not exactly reflect the retrieval scenario at inference time. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced, focusing on dense phrase retrieval (Lee et al., 2021) where billions of representations are indexed at inference. Since validating every dense retriever with a large-scale index is practically infeasible, we propose an efficient way of validating dense retrievers using a small subset of the entire corpus. This allows us to validate various training strategies including unifying contrastive loss terms and using hard negatives for phrase retrieval, which largely reduces the training-inference discrepancy. As a result, we improve top-1 phrase retrieval accuracy by 2~3 points and top-20 passage retrieval accuracy by 2~4 points for open-domain question answering. Our work urges modeling dense retrievers with careful consideration of training and inference via efficient validation while advancing phrase retrieval as a general solution for dense retrieval.

* Findings of EMNLP 2022; 12 pages, 3 figures 
Viaarxiv icon

Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task

Sep 28, 2022
Hakan Inan, Rashi Rungta, Yashar Mehdad

Figure 1 for Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task
Figure 2 for Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task
Figure 3 for Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task
Figure 4 for Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task

Text segmentation aims to divide text into contiguous, semantically coherent segments, while segment labeling deals with producing labels for each segment. Past work has shown success in tackling segmentation and labeling for documents and conversations. This has been possible with a combination of task-specific pipelines, supervised and unsupervised learning objectives. In this work, we propose a single encoder-decoder neural network that can handle long documents and conversations, trained simultaneously for both segmentation and segment labeling using only standard supervision. We successfully show a way to solve the combined task as a pure generation task, which we refer to as structured summarization. We apply the same technique to both document and conversational data, and we show state of the art performance across datasets for both segmentation and labeling, under both high- and low-resource settings. Our results establish a strong case for considering text segmentation and segment labeling as a whole, and moving towards general-purpose techniques that don't depend on domain expertise or task-specific components.

Viaarxiv icon

Adapting Pretrained Text-to-Text Models for Long Text Sequences

Sep 21, 2022
Wenhan Xiong, Anchit Gupta, Shubham Toshniwal, Yashar Mehdad, Wen-tau Yih

Figure 1 for Adapting Pretrained Text-to-Text Models for Long Text Sequences
Figure 2 for Adapting Pretrained Text-to-Text Models for Long Text Sequences
Figure 3 for Adapting Pretrained Text-to-Text Models for Long Text Sequences
Figure 4 for Adapting Pretrained Text-to-Text Models for Long Text Sequences

We present an empirical study of adapting an existing pretrained text-to-text model for long-sequence inputs. Through a comprehensive study along three axes of the pretraining pipeline -- model architecture, optimization objective, and pretraining corpus, we propose an effective recipe to build long-context models from existing short-context models. Specifically, we replace the full attention in transformers with pooling-augmented blockwise attention, and pretrain the model with a masked-span prediction task with spans of varying length. In terms of the pretraining corpus, we find that using randomly concatenated short-documents from a large open-domain corpus results in better performance than using existing long document corpora which are typically limited in their domain coverage. With these findings, we build a long-context model that achieves competitive performance on long-text QA tasks and establishes the new state of the art on five long-text summarization datasets, often outperforming previous methods with larger model sizes.

Viaarxiv icon

BiT: Robustly Binarized Multi-distilled Transformer

May 25, 2022
Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, Yashar Mehdad

Figure 1 for BiT: Robustly Binarized Multi-distilled Transformer
Figure 2 for BiT: Robustly Binarized Multi-distilled Transformer
Figure 3 for BiT: Robustly Binarized Multi-distilled Transformer
Figure 4 for BiT: Robustly Binarized Multi-distilled Transformer

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.

Viaarxiv icon