Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Duyu Tang

One Model, Multiple Tasks: Pathways for Natural Language Understanding

Mar 07, 2022
Duyu Tang, Fan Zhang, Yong Dai, Cong Zhou, Shuangzhi Wu, Shuming Shi

Figure 1 for One Model, Multiple Tasks: Pathways for Natural Language Understanding

Figure 2 for One Model, Multiple Tasks: Pathways for Natural Language Understanding

Figure 3 for One Model, Multiple Tasks: Pathways for Natural Language Understanding

Figure 4 for One Model, Multiple Tasks: Pathways for Natural Language Understanding

This paper presents a Pathways approach to handle many tasks at once. Our approach is general-purpose and sparse. Unlike prevailing single-purpose models that overspecialize at individual tasks and learn from scratch when being extended to new tasks, our approach is general-purpose with the ability of stitching together existing skills to learn new tasks more effectively. Different from traditional dense models that always activate all the model parameters, our approach is sparsely activated: only relevant parts of the model (like pathways through the network) are activated. We take natural language understanding as a case study and define a set of skills like \textit{the skill of understanding the sentiment of text} and \textit{the skill of understanding natural language questions}. These skills can be reused and combined to support many different tasks and situations. We develop our system using Transformer as the backbone. For each skill, we implement skill-specific feed-forward networks, which are activated only if the skill is relevant to the task. An appealing feature of our model is that it not only supports sparsely activated fine-tuning, but also allows us to pretrain skills in the same sparse way with masked language modeling and next sentence prediction. We call this model \textbf{SkillNet}. We have three major findings. First, with only one model checkpoint, SkillNet performs better than task-specific fine-tuning and two multi-task learning baselines (i.e., dense model and Mixture-of-Experts model) on six tasks. Second, sparsely activated pre-training further improves the overall performance. Third, SkillNet significantly outperforms baseline systems when being extended to new tasks.

Via

Access Paper or Ask Questions

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Mar 02, 2022
Yong Dai, Linyang Li, Cong Zhou, Zhangyin Feng, Enbo Zhao, Xipeng Qiu, Piji Li, Duyu Tang

Figure 1 for "Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Figure 2 for "Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Figure 3 for "Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Figure 4 for "Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Whole word masking (WWM), which masks all subwords corresponding to a word at once, makes a better English BERT model. For the Chinese language, however, there is no subword because each token is an atomic character. The meaning of a word in Chinese is different in that a word is a compositional unit consisting of multiple characters. Such difference motivates us to investigate whether WWM leads to better context understanding ability for Chinese BERT. To achieve this, we introduce two probing tasks related to grammatical error correction and ask pretrained models to revise or insert tokens in a masked language modeling manner. We construct a dataset including labels for 19,075 tokens in 10,448 sentences. We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively. Our major findings are as follows: First, when one character needs to be inserted or replaced, the model trained with CLM performs the best. Second, when more than one character needs to be handled, WWM is the key to better performance. Finally, when being fine-tuned on sentence-level downstream tasks, models trained with different masking strategies perform comparably.

* Short paper in Findings of ACL 2022

Via

Access Paper or Ask Questions

Exploring and Adapting Chinese GPT to Pinyin Input Method

Mar 02, 2022
Minghuan Tan, Yong Dai, Duyu Tang, Zhangyin Feng, Guoping Huang, Jing Jiang, Jiwei Li, Shuming Shi

Figure 1 for Exploring and Adapting Chinese GPT to Pinyin Input Method

Figure 2 for Exploring and Adapting Chinese GPT to Pinyin Input Method

Figure 3 for Exploring and Adapting Chinese GPT to Pinyin Input Method

Figure 4 for Exploring and Adapting Chinese GPT to Pinyin Input Method

While GPT has become the de-facto method for text generation tasks, its application to pinyin input method remains unexplored. In this work, we make the first exploration to leverage Chinese GPT for pinyin input method. We find that a frozen GPT achieves state-of-the-art performance on perfect pinyin. However, the performance drops dramatically when the input includes abbreviated pinyin. A reason is that an abbreviated pinyin can be mapped to many perfect pinyin, which links to even larger number of Chinese characters. We mitigate this issue with two strategies, including enriching the context with pinyin and optimizing the training process to help distinguish homophones. To further facilitate the evaluation of pinyin input method, we create a dataset consisting of 270K instances from 15 domains. Results show that our approach improves performance on abbreviated pinyin across all domains. Model analysis demonstrates that both strategies contribute to the performance boost.

* To appear in ACL 2022

Via

Access Paper or Ask Questions

Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

Feb 24, 2022
Zhangyin Feng, Duyu Tang, Cong Zhou, Junwei Liao, Shuangzhi Wu, Xiaocheng Feng, Bing Qin, Yunbo Cao, Shuming Shi

Figure 1 for Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

Figure 2 for Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

Figure 3 for Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

Figure 4 for Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

The standard BERT adopts subword-based tokenization, which may break a word into two or more wordpieces (e.g., converting "lossless" to "loss" and "less"). This will bring inconvenience in following situations: (1) what is the best way to obtain the contextual vector of a word that is divided into multiple wordpieces? (2) how to predict a word via cloze test without knowing the number of wordpieces in advance? In this work, we explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces. We call such word-level BERT model as WordBERT. We train models with different vocabulary sizes, initialization configurations and languages. Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension. On many other natural language understanding tasks, including POS tagging, chunking and NER, WordBERT consistently performs better than BERT. Model analysis indicates that the major advantage of WordBERT over BERT lies in the understanding for low-frequency words and rare words. Furthermore, since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets. Lastly, the analyse on inference speed illustrates WordBERT has comparable time cost to BERT in natural language understanding tasks.

Via

Access Paper or Ask Questions

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

May 27, 2021
Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, Nan Duan

Figure 1 for CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Figure 2 for CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Figure 3 for CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Figure 4 for CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

* ACL 2021 main conference. The CoSQA data and leaderboard are available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-WebQuery. The code is available at https://github.com/Jun-jie-Huang/CoCLR

Via

Access Paper or Ask Questions

Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text

May 08, 2021
Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, Nan Duan

Figure 1 for Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text

Figure 2 for Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text

Figure 3 for Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text

Figure 4 for Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text

Logical reasoning of text requires understanding critical logical information in the text and performing inference over them. Large-scale pre-trained models for logical reasoning mainly focus on word-level semantics of text while struggling to capture symbolic logic. In this paper, we propose to understand logical symbols and expressions in the text to arrive at the answer. Based on such logical information, we not only put forward a context extension framework but also propose a data augmentation algorithm. The former extends the context to cover implicit logical expressions following logical equivalence laws. The latter augments literally similar but logically different instances to better capture logical information, especially logical negative and conditional relationships. We conduct experiments on ReClor dataset. The results show that our method achieves the state-of-the-art performance, and both logic-driven context extension framework and data augmentation algorithm can help improve the accuracy. And our multi-model ensemble system is the first to surpass human performance on both EASY set and HARD set of ReClor.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

AR-LSAT: Investigating Analytical Reasoning of Text

Apr 15, 2021
Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Jiahai Wang, Jian Yin, Ming Zhou, Nan Duan

Figure 1 for AR-LSAT: Investigating Analytical Reasoning of Text

Figure 2 for AR-LSAT: Investigating Analytical Reasoning of Text

Figure 3 for AR-LSAT: Investigating Analytical Reasoning of Text

Figure 4 for AR-LSAT: Investigating Analytical Reasoning of Text

Analytical reasoning is an essential and challenging task that requires a system to analyze a scenario involving a set of particular circumstances and perform reasoning over it to make conclusions. In this paper, we study the challenge of analytical reasoning of text and introduce a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016. We analyze what knowledge understanding and reasoning abilities are required to do well on this task. Furthermore, to address this reasoning challenge, we design two different baselines: (1) a Transformer-based method which leverages the state-of-the-art pre-trained language models and (2) Analytical Reasoning Machine (ARM), a logical-level reasoning framework extracting symbolic knowledge (e.g, participants, facts, logical functions) to deduce legitimate solutions. In our experiments, we find that the Transformer-based models struggle to solve this task as their performance is close to random guess and ARM achieves better performance by leveraging symbolic knowledge and interpretable reasoning steps. Results show that both methods still lag far behind human performance, which leave further space for future research.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Apr 09, 2021
Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, Nan Duan

Figure 1 for WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Figure 2 for WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Figure 3 for WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Figure 4 for WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.

Via

Access Paper or Ask Questions

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Feb 09, 2021
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu

Figure 1 for CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Figure 2 for CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Figure 3 for CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Figure 4 for CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

Via

Access Paper or Ask Questions

Syntax-Enhanced Pre-trained Model

Dec 28, 2020
Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Nan Duan, Daxin Jiang

Figure 1 for Syntax-Enhanced Pre-trained Model

Figure 2 for Syntax-Enhanced Pre-trained Model

Figure 3 for Syntax-Enhanced Pre-trained Model

Figure 4 for Syntax-Enhanced Pre-trained Model

We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa. Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages. Such a problem would lead to the necessity of having human-annotated syntactic information, which limits the application of existing methods to broader scenarios. To address this, we present a model that utilizes the syntax of text in both pre-training and fine-tuning stages. Our model is based on Transformer with a syntax-aware attention layer that considers the dependency tree of the text. We further introduce a new pre-training task of predicting the syntactic distance among tokens in the dependency tree. We evaluate the model on three downstream tasks, including relation classification, entity typing, and question answering. Results show that our model achieves state-of-the-art performance on six public benchmark datasets. We have two major findings. First, we demonstrate that infusing automatically produced syntax of text improves pre-trained models. Second, global syntactic distances among tokens bring larger performance gains compared to local head relations between contiguous tokens.

Via

Access Paper or Ask Questions