To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first efficiently drafts several future tokens and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, including current leading techniques, the challenges faced, and potential future directions in this field. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.
Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.
Pretrained language models have achieved remarkable success in a variety of natural language understanding tasks. Nevertheless, finetuning large pretrained models on downstream tasks is susceptible to overfitting if the training set is limited, which will lead to diminished performance. In this work, we propose a dynamic fine-tuning strategy for pretrained language models called Bi-Drop. It utilizes the gradient information of various sub-models generated by dropout to update the model parameters selectively. Experiments on the GLUE benchmark show that Bi-Drop outperforms previous fine-tuning methods by a considerable margin, and exhibits consistent superiority over vanilla fine-tuning across various pretrained models. Furthermore, empirical results indicate that Bi-Drop yields substantial improvements in the multiple task or domain transfer, data imbalance, and low-resource scenarios, demonstrating superb generalization ability and robustness.
* Work in progress. Co-first authors with equal contributions
Continual relation extraction (CRE) models aim at handling emerging new relations while avoiding catastrophically forgetting old ones in the streaming data. Though improvements have been shown by previous CRE studies, most of them only adopt a vanilla strategy when models first learn representations of new relations. In this work, we point out that there exist two typical biases after training of this vanilla strategy: classifier bias and representation bias, which causes the previous knowledge that the model learned to be shaded. To alleviate those biases, we propose a simple yet effective classifier decomposition framework that splits the last FFN layer into separated previous and current classifiers, so as to maintain previous knowledge and encourage the model to learn more robust representations at this training stage. Experimental results on two standard benchmarks show that our proposed framework consistently outperforms the state-of-the-art CRE models, which indicates that the importance of the first training stage to CRE models may be underestimated. Our code is available at https://github.com/hemingkx/CDec.
We study lossless acceleration for seq2seq generation with a novel decoding algorithm -- Aggressive Decoding. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, our approach aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup, achieved by innovative cooperation of aggressive decoding and verification that are both efficient due to parallel computing. We propose two Aggressive Decoding paradigms for 2 kinds of seq2seq tasks: 1) For the seq2seq tasks whose inputs and outputs are highly similar (e.g., Grammatical Error Correction), we propose Input-guided Aggressive Decoding (IAD) that aggressively copies from the input sentence as drafted decoded tokens to verify in parallel; 2) For other general seq2seq tasks (e.g., Machine Translation), we propose Generalized Aggressive Decoding (GAD) that first employs an additional non-autoregressive decoding model for aggressive decoding and then verifies in parallel in the autoregressive manner. We test Aggressive Decoding on the most popular 6-layer Transformer model on GPU in multiple seq2seq tasks: 1) For IAD, we show that it can introduce a 7x-9x speedup for the Transformer in Grammatical Error Correction and Text Simplification tasks with the identical results as greedy decoding; 2) For GAD, we observe a 3x-5x speedup with the identical or even better quality in two important seq2seq tasks: Machine Translation and Abstractive Summarization. Moreover, Aggressive Decoding can benefit even more from stronger computing devices that are better at parallel computing. Given the lossless quality as well as significant and promising speedup, we believe Aggressive Decoding may potentially evolve into a de facto standard for efficient and lossless seq2seq generation in the near future.
* 24-page Microsoft Research Technical Report. Content overlap with
arXiv:2106.04970 and arXiv:2203.16487
In this paper, we propose Generalized Aggressive Decoding (GAD) -- a novel decoding paradigm for speeding up autoregressive translation without quality loss, through the collaboration of autoregressive and non-autoregressive translation (NAT) of the Transformer. At each decoding iteration, GAD aggressively decodes a number of tokens in parallel as a draft with NAT and then verifies them in the autoregressive manner, where only the tokens that pass the verification are kept as decoded tokens. GAD can achieve the same performance as autoregressive translation but much more efficiently because both NAT drafting and autoregressive verification are fast due to parallel computing. We conduct experiments in the WMT14 English-German translation task and confirm that the vanilla GAD yields exactly the same results as greedy decoding with an around 3x speedup, and that its variant (GAD++) with an advanced verification strategy not only outperforms the greedy translation and even achieves the comparable translation quality with the beam search result, but also further improves the decoding speed, resulting in an around 5x speedup over autoregressive translation. Our models and codes are available at https://github.com/hemingkx/Generalized-Aggressive-Decoding.
Reasoning is one of the major challenges of Human-like AI and has recently attracted intensive attention from natural language processing (NLP) researchers. However, cross-modal reasoning needs further research. For cross-modal reasoning, we observe that most methods fall into shallow feature matching without in-depth human-like reasoning.The reason lies in that existing cross-modal tasks directly ask questions for a image. However, human reasoning in real scenes is often made under specific background information, a process that is studied by the ABC theory in social psychology. We propose a shared task named "Premise-based Multimodal Reasoning" (PMR), which requires participating models to reason after establishing a profound understanding of background information. We believe that the proposed PMR would contribute to and help shed a light on human-like in-depth reasoning.
In recent years, convolutional neural network (CNN) and other deep learning models have been gradually introduced into the area of gravitational-wave (GW) data processing. Compared with the traditional matched-filtering techniques, CNN has significant advantages in efficiency in GW signal detection tasks. In addition, matched-filtering techniques are based on the template bank of the existing theoretical waveform, which makes it difficult to find GW signals beyond theoretical expectation. In this paper, based on the task of GW detection of binary black holes, we introduce the optimization techniques of deep learning, such as batch normalization and dropout, to CNN models. Detailed studies of model performance are carried out. Through this study, we recommend to use batch normalization and dropout techniques in CNN models in GW signal detection tasks. Furthermore, we investigate the generalization ability of CNN models on different parameter ranges of GW signals. We point out that CNN models are robust to the variation of the parameter range of the GW waveform. This is a major advantage of deep learning models over matched-filtering techniques.