Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qun Liu

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Oct 10, 2020
Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu

Figure 1 for TernaryBERT: Distillation-aware Ultra-low Bit BERT

Figure 2 for TernaryBERT: Distillation-aware Ultra-low Bit BERT

Figure 3 for TernaryBERT: Distillation-aware Ultra-low Bit BERT

Figure 4 for TernaryBERT: Distillation-aware Ultra-low Bit BERT

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks.However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT. Moreover, to reduce the accuracy degradation caused by the lower capacity of low bits, we leverage the knowledge distillation technique in the training process. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the full-precision model while being 14.9x smaller.

* Accepted by EMNLP 2020

Via

Access Paper or Ask Questions

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Oct 06, 2020
Yimeng Wu, Peyman Passban, Mehdi Rezagholizade, Qun Liu

Figure 1 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Figure 2 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Figure 3 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Figure 4 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese--English, Turkish--English, and English--German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.

* The first two authors contributed equally

Via

Access Paper or Ask Questions

TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling

Aug 12, 2020
Shuai Zhang, Peng Zhang, Xindian Ma, Junqiu Wei, Ningning Wang, Qun Liu

Figure 1 for TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling

Figure 2 for TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling

Figure 3 for TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling

Figure 4 for TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling

Transformer has been widely-used in many Natural Language Processing (NLP) tasks and the scaled dot-product attention between tokens is a core module of Transformer. This attention is a token-wise design and its complexity is quadratic to the length of sequence, limiting its application potential for long sequence tasks. In this paper, we propose a dimension-wise attention mechanism based on which a novel language modeling approach (namely TensorCoder) can be developed. The dimension-wise attention can reduce the attention complexity from the original $O(N^2d)$ to $O(Nd^2)$, where $N$ is the length of the sequence and $d$ is the dimensionality of head. We verify TensorCoder on two tasks including masked language modeling and neural machine translation. Compared with the original Transformer, TensorCoder not only greatly reduces the calculation of the original model but also obtains improved performance on masked language modeling task (in PTB dataset) and comparable performance on machine translation tasks.

Via

Access Paper or Ask Questions

Learning to Detect Unacceptable Machine Translations for Downstream Tasks

May 08, 2020
Meng Zhang, Xin Jiang, Yang Liu, Qun Liu

Figure 1 for Learning to Detect Unacceptable Machine Translations for Downstream Tasks

Figure 2 for Learning to Detect Unacceptable Machine Translations for Downstream Tasks

Figure 3 for Learning to Detect Unacceptable Machine Translations for Downstream Tasks

Figure 4 for Learning to Detect Unacceptable Machine Translations for Downstream Tasks

The field of machine translation has progressed tremendously in recent years. Even though the translation quality has improved significantly, current systems are still unable to produce uniformly acceptable machine translations for the variety of possible use cases. In this work, we put machine translation in a cross-lingual pipeline and introduce downstream tasks to define task-specific acceptability of machine translations. This allows us to leverage parallel data to automatically generate acceptability annotations on a large scale, which in turn help to learn acceptability detectors for the downstream tasks. We conduct experiments to demonstrate the effectiveness of our framework for a range of downstream tasks and translation models.

Via

Access Paper or Ask Questions

Accurate Word Alignment Induction from Neural Machine Translation

Apr 30, 2020
Yun Chen, Yang Liu, Guanhua Chen, Xin Jiang, Qun Liu

Figure 1 for Accurate Word Alignment Induction from Neural Machine Translation

Figure 2 for Accurate Word Alignment Induction from Neural Machine Translation

Figure 3 for Accurate Word Alignment Induction from Neural Machine Translation

Figure 4 for Accurate Word Alignment Induction from Neural Machine Translation

Despite its original goal to jointly learn to align and translate, prior researches suggest that the state-of-the-art neural machine translation model Transformer captures poor word alignment through its attention mechanism. In this paper, we show that attention weights do capture accurate word alignment, which could only be revealed if we choose the correct decoding step and layer to induce word alignment. We propose to induce alignment with the to-be-aligned target token as the decoder input and present two simple but effective interpretation methods for word alignment induction, either through the attention weights or the leave-one-out measures. In contrast to previous studies, we find that attention weights capture better word alignment than the leave-one-out measures under our setting. Using the proposed method with attention weights, we greatly improve over fast-align on word alignment induction. Finally, we present a multi-task learning framework to train the Transformer model and show that by incorporating GIZA++ alignments into our multi-task training, we can induce significantly better alignments than GIZA++.

Via

Access Paper or Ask Questions

Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT

Apr 30, 2020
Zhiyong Wu, Yun Chen, Ben Kao, Qun Liu

Figure 1 for Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT

Figure 2 for Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT

Figure 3 for Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT

Figure 4 for Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT

By introducing a small set of additional parameters, a probe learns to solve specific linguistic tasks (e.g., dependency parsing) in a supervised manner using feature representations (e.g., contextualized embeddings). The effectiveness of such probing tasks is taken as evidence that the pre-trained model encodes linguistic knowledge. However, this approach of evaluating a language model is undermined by the uncertainty of the amount of knowledge that is learned by the probe itself. Complementary to those works, we propose a parameter-free probing technique for analyzing pre-trained language models (e.g., BERT). Our method does not require direct supervision from the probing tasks, nor do we introduce additional parameters to the probing process. Our experiments on BERT show that syntactic trees recovered from BERT using our method are significantly better than linguistically-uninformed baselines. We further feed the empirically induced dependency structures into a downstream sentiment classification task and find its improvement compatible with or even superior to a human-designed dependency schema.

* Accepted to ACL2020 as a long paper

Via

Access Paper or Ask Questions

Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

Apr 24, 2020
Yi Liao, Xin Jiang, Qun Liu

Figure 1 for Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

Figure 2 for Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

Figure 3 for Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

Figure 4 for Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

Masked language model and autoregressive language model are two types of language models. While pretrained masked language models such as BERT overwhelm the line of natural language understanding (NLU) tasks, autoregressive language models such as GPT are especially capable in natural language generation (NLG). In this paper, we propose a probabilistic masking scheme for the masked language model, which we call probabilistically masked language model (PMLM). We implement a specific PMLM with a uniform prior distribution on the masking ratio named u-PMLM. We prove that u-PMLM is equivalent to an autoregressive permutated language model. One main advantage of the model is that it supports text generation in arbitrary order with surprisingly good quality, which could potentially enable new applications over traditional unidirectional generation. Besides, the pretrained u-PMLM also outperforms BERT on a set of downstream NLU tasks.

* Accepted by ACL 2020

Via

Access Paper or Ask Questions

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Apr 08, 2020
Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu

Figure 1 for DynaBERT: Dynamic BERT with Adaptive Width and Depth

Figure 2 for DynaBERT: Dynamic BERT with Adaptive Width and Depth

Figure 3 for DynaBERT: Dynamic BERT with Adaptive Width and Depth

Figure 4 for DynaBERT: Dynamic BERT with Adaptive Width and Depth

The pre-trained language models like BERT and RoBERTa, though powerful in many natural language processing tasks, are both computational and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually reduce the large BERT model to a fixed smaller size, and can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can run at adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allows both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT (or RoBERTa), while at smaller widths and depths consistently outperforms existing BERT compression methods.

Via

Access Paper or Ask Questions