Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming Zhou

Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

Apr 07, 2020
Daya Guo, Akari Asai, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Jian Yin, Ming Zhou

Figure 1 for Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

Figure 2 for Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

Figure 3 for Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

Figure 4 for Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

We study the problem of generating inferential texts of events for a variety of commonsense like \textit{if-else} relations. Existing approaches typically use limited evidence from training examples and learn for each relation individually. In this work, we use multiple knowledge sources as fuels for the model. Existing commonsense knowledge bases like ConceptNet are dominated by taxonomic knowledge (e.g., \textit{isA} and \textit{relatedTo} relations), having a limited number of inferential knowledge. We use not only structured commonsense knowledge bases, but also natural language snippets from search-engine results. These sources are incorporated into a generative base model via key-value memory network. In addition, we introduce a meta-learning based multi-task learning algorithm. For each targeted commonsense relation, we regard the learning of examples from other relations as the meta-training process, and the evaluation on examples from the targeted relation as the meta-test process. We conduct experiments on Event2Mind and ATOMIC datasets. Results show that both the integration of multiple knowledge sources and the use of the meta-learning algorithm improve the performance.

Via

Access Paper or Ask Questions

At Which Level Should We Extract? An Empirical Study on Extractive Document Summarization

Apr 06, 2020
Qingyu Zhou, Furu Wei, Ming Zhou

Figure 1 for At Which Level Should We Extract? An Empirical Study on Extractive Document Summarization

Figure 2 for At Which Level Should We Extract? An Empirical Study on Extractive Document Summarization

Figure 3 for At Which Level Should We Extract? An Empirical Study on Extractive Document Summarization

Figure 4 for At Which Level Should We Extract? An Empirical Study on Extractive Document Summarization

Extractive methods have proven to be very effective in automatic document summarization. Previous works perform this task by identifying informative contents at sentence level. However, it is unclear whether performing extraction at sentence level is the best solution. In this work, we show that unnecessity and redundancy issues exist when extracting full sentences, and extracting sub-sentential units is a promising alternative. Specifically, we propose extracting sub-sentential units on the corresponding constituency parsing tree. A neural extractive model which leverages the sub-sentential information and extracts them is presented. Extensive experiments and analyses show that extracting sub-sentential units performs competitively comparing to full sentence extraction under the evaluation of both automatic and human evaluations. Hopefully, our work could provide some inspiration of the basic extraction units in extractive summarization for future research.

Via

Access Paper or Ask Questions

Learning to Summarize Passages: Mining Passage-Summary Pairs from Wikipedia Revision Histories

Apr 06, 2020
Qingyu Zhou, Furu Wei, Ming Zhou

Figure 1 for Learning to Summarize Passages: Mining Passage-Summary Pairs from Wikipedia Revision Histories

Figure 2 for Learning to Summarize Passages: Mining Passage-Summary Pairs from Wikipedia Revision Histories

Figure 3 for Learning to Summarize Passages: Mining Passage-Summary Pairs from Wikipedia Revision Histories

Figure 4 for Learning to Summarize Passages: Mining Passage-Summary Pairs from Wikipedia Revision Histories

In this paper, we propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories. In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously. The constructed dataset contains more than one hundred thousand passage-summary pairs. The quality analysis shows that it is promising that the dataset can be used as a training and validation set for passage summarization. We validate and analyze the performance of various summarization systems on the proposed dataset. The dataset will be available online at https://res.qyzhou.me.

Via

Access Paper or Ask Questions

STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Apr 04, 2020
Yanyan Zou, Xingxing Zhang, Wei Lu, Furu Wei, Ming Zhou

Figure 1 for STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Figure 2 for STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Figure 3 for STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Figure 4 for STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Abstractive summarization aims to rewrite a long document to its shorter form, which is usually modeled as a sequence-to-sequence (Seq2Seq) learning problem. Seq2Seq Transformers are powerful models for this problem. Unfortunately, training large Seq2Seq Transformers on limited supervised summarization data is challenging. We, therefore, propose STEP (as shorthand for Sequence-to-Sequence Transformer Pre-training), which can be trained on large scale unlabeled documents. Specifically, STEP is pre-trained using three different tasks, namely sentence reordering, next sentence generation, and masked document generation. Experiments on two summarization datasets show that all three tasks can improve performance upon a heavily tuned large Seq2Seq Transformer which already includes a strong pre-trained encoder by a large margin. By using our best task to pre-train STEP, we outperform the best published abstractive model on CNN/DailyMail by 0.8 ROUGE-2 and New York Times by 2.4 ROUGE-2.

Via

Access Paper or Ask Questions

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Apr 03, 2020
Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Bruce Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Ming Zhou

Figure 1 for XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Figure 2 for XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Figure 3 for XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Figure 4 for XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

In this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.

Via

Access Paper or Ask Questions

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Mar 04, 2020
Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou

Figure 1 for XGPT: Cross-modal Generative Pre-Training for Image Captioning

Figure 2 for XGPT: Cross-modal Generative Pre-Training for Image Captioning

Figure 3 for XGPT: Cross-modal Generative Pre-Training for Image Captioning

Figure 4 for XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

* 12 pages, 3 figures, 7 tables

Via

Access Paper or Ask Questions

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Feb 28, 2020
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

Figure 1 for UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Figure 2 for UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Figure 3 for UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Figure 4 for UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.

* 11 pages

Via

Access Paper or Ask Questions

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Feb 25, 2020
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou

Figure 1 for MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Figure 2 for MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Figure 3 for MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Figure 4 for MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. The code and models are publicly available at https://github.com/microsoft/unilm/tree/master/minilm

* Code and models: https://github.com/microsoft/unilm/tree/master/minilm

Via

Access Paper or Ask Questions

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

Feb 22, 2020
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou

Figure 1 for ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

Figure 2 for ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

Figure 3 for ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

Figure 4 for ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.

Via

Access Paper or Ask Questions

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Feb 19, 2020
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou

Figure 1 for CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Figure 2 for CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Figure 3 for CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Figure 4 for CodeBERT: A Pre-Trained Model for Programming and Natural Languages

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

* 10 pages

Via

Access Paper or Ask Questions