Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zewen Chi

Language Models are General-Purpose Interfaces

Jun 13, 2022

Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei

Figure 1 for Language Models are General-Purpose Interfaces

Figure 2 for Language Models are General-Purpose Interfaces

Figure 3 for Language Models are General-Purpose Interfaces

Figure 4 for Language Models are General-Purpose Interfaces

Abstract:Foundation models have received much attention due to their effectiveness across a broad range of downstream applications. Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for specific tasks or modalities. In this work, we propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer. We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders. We subsume the advantages and capabilities from both causal and non-causal modeling, thereby combining the best of two worlds. Specifically, the proposed method not only inherits the capabilities of in-context learning and open-ended generation from causal language modeling, but also is conducive to finetuning because of the bidirectional encoders. More importantly, our approach seamlessly unlocks the combinations of the above capabilities, e.g., enabling in-context learning or instruction following with finetuned encoders. Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.

* 32 pages. The first three authors contribute equally

Via

Access Paper or Ask Questions

On the Representation Collapse of Sparse Mixture of Experts

Apr 20, 2022

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei

Figure 1 for On the Representation Collapse of Sparse Mixture of Experts

Figure 2 for On the Representation Collapse of Sparse Mixture of Experts

Figure 3 for On the Representation Collapse of Sparse Mixture of Experts

Figure 4 for On the Representation Collapse of Sparse Mixture of Experts

Abstract:Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Via

Access Paper or Ask Questions

Cross-Lingual Phrase Retrieval

Apr 19, 2022

Heqi Zheng, Xiao Zhang, Zewen Chi, Heyan Huang, Tan Yan, Tian Lan, Wei Wei, Xian-Ling Mao

Figure 1 for Cross-Lingual Phrase Retrieval

Figure 2 for Cross-Lingual Phrase Retrieval

Figure 3 for Cross-Lingual Phrase Retrieval

Figure 4 for Cross-Lingual Phrase Retrieval

Abstract:Cross-lingual retrieval aims to retrieve relevant text across languages. Current methods typically achieve cross-lingual retrieval by learning language-agnostic text representations in word or sentence level. However, how to learn phrase representations for cross-lingual phrase retrieval is still an open problem. In this paper, we propose XPR, a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences. Moreover, we create a large-scale cross-lingual phrase retrieval dataset, which contains 65K bilingual phrase pairs and 4.2M example sentences in 8 English-centric language pairs. Experimental results show that XPR outperforms state-of-the-art baselines which utilize word-level or sentence-level representations. XPR also shows impressive zero-shot transferability that enables the model to perform retrieval in an unseen language pair during training. Our dataset, code, and trained models are publicly available at www.github.com/cwszz/XPR/.

Via

Access Paper or Ask Questions

Bridging the Gap: Cross-Lingual Summarization with Compression Rate

Oct 15, 2021

Yu Bai, Heyan Huang, Kai Fan, Yang Gao, Zewen Chi, Boxing Chen

Figure 1 for Bridging the Gap: Cross-Lingual Summarization with Compression Rate

Figure 2 for Bridging the Gap: Cross-Lingual Summarization with Compression Rate

Figure 3 for Bridging the Gap: Cross-Lingual Summarization with Compression Rate

Figure 4 for Bridging the Gap: Cross-Lingual Summarization with Compression Rate

Abstract:Cross-lingual Summarization (CLS), converting a document into a cross-lingual summary, is highly related to Machine Translation (MT) task. However, MT resources are still underutilized for the CLS task. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit cross-lingual summarization through large-scale MT corpus. Through introducing compression rate, we regard MT task as a special CLS task with the compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. Moreover, to bridge these two tasks smoothly, we propose a simple yet effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines.

* Work in progress

Via

Access Paper or Ask Questions

Cross-Lingual Language Model Meta-Pretraining

Sep 23, 2021

Zewen Chi, Heyan Huang, Luyang Liu, Yu Bai, Xian-Ling Mao

Figure 1 for Cross-Lingual Language Model Meta-Pretraining

Figure 2 for Cross-Lingual Language Model Meta-Pretraining

Figure 3 for Cross-Lingual Language Model Meta-Pretraining

Figure 4 for Cross-Lingual Language Model Meta-Pretraining

Abstract:The success of pretrained cross-lingual language models relies on two essential abilities, i.e., generalization ability for learning downstream tasks in a source language, and cross-lingual transferability for transferring the task knowledge to other languages. However, current methods jointly learn the two abilities in a single-phase cross-lingual pretraining process, resulting in a trade-off between generalization and cross-lingual transfer. In this paper, we propose cross-lingual language model meta-pretraining, which learns the two abilities in different training phases. Our method introduces an additional meta-pretraining phase before cross-lingual pretraining, where the model learns generalization ability on a large-scale monolingual corpus. Then, the model focuses on learning cross-lingual transfer on a multilingual corpus. Experimental results show that our method improves both generalization and cross-lingual transfer, and produces better-aligned representations across different languages.

Via

Access Paper or Ask Questions

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Jun 30, 2021

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, Furu Wei

Figure 1 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Figure 2 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Figure 3 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Figure 4 for XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Abstract:In this paper, we introduce ELECTRA-style tasks to cross-lingual language model pre-training. Specifically, we present two pre-training tasks, namely multilingual replaced token detection, and translation replaced token detection. Besides, we pretrain the model, named as XLM-E, on both multilingual and parallel corpora. Our model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Moreover, analysis shows that XLM-E tends to obtain better cross-lingual transferability.

Via

Access Paper or Ask Questions

Consistency Regularization for Cross-Lingual Fine-Tuning

Jun 15, 2021

Bo Zheng, Li Dong, Shaohan Huang, Wenhui Wang, Zewen Chi, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei

Figure 1 for Consistency Regularization for Cross-Lingual Fine-Tuning

Figure 2 for Consistency Regularization for Cross-Lingual Fine-Tuning

Figure 3 for Consistency Regularization for Cross-Lingual Fine-Tuning

Figure 4 for Consistency Regularization for Cross-Lingual Fine-Tuning

Abstract:Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.

* ACL-2021

Via

Access Paper or Ask Questions

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Jun 11, 2021

Zewen Chi, Li Dong, Bo Zheng, Shaohan Huang, Xian-Ling Mao, Heyan Huang, Furu Wei

Figure 1 for Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Figure 2 for Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Figure 3 for Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Figure 4 for Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Abstract:The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at https://github.com/CZWin32768/XLM-Align.

* ACL-2021

Via

Access Paper or Ask Questions

mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Apr 18, 2021

Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang Xian-Ling Mao, Heyan Huang, Furu Wei

Figure 1 for mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Figure 2 for mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Figure 3 for mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Figure 4 for mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Abstract:Multilingual T5 (mT5) pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on seven multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

Via

Access Paper or Ask Questions

A Robust and Domain-Adaptive Approach for Low-Resource Named Entity Recognition

Jan 02, 2021

Houjin Yu, Xian-Ling Mao, Zewen Chi, Wei Wei, Heyan Huang

Figure 1 for A Robust and Domain-Adaptive Approach for Low-Resource Named Entity Recognition

Figure 2 for A Robust and Domain-Adaptive Approach for Low-Resource Named Entity Recognition

Figure 3 for A Robust and Domain-Adaptive Approach for Low-Resource Named Entity Recognition

Figure 4 for A Robust and Domain-Adaptive Approach for Low-Resource Named Entity Recognition

Abstract:Recently, it has attracted much attention to build reliable named entity recognition (NER) systems using limited annotated data. Nearly all existing works heavily rely on domain-specific resources, such as external lexicons and knowledge bases. However, such domain-specific resources are often not available, meanwhile it's difficult and expensive to construct the resources, which has become a key obstacle to wider adoption. To tackle the problem, in this work, we propose a novel robust and domain-adaptive approach RDANER for low-resource NER, which only uses cheap and easily obtainable resources. Extensive experiments on three benchmark datasets demonstrate that our approach achieves the best performance when only using cheap and easily obtainable resources, and delivers competitive results against state-of-the-art methods which use difficultly obtainable domainspecific resources. All our code and corpora can be found on https://github.com/houking-can/RDANER.

* 2020 IEEE International Conference on Knowledge Graph (ICKG) (pp. 297-304)-
* Best Student Paper of 2020 IEEE International Conference on Knowledge Graph (ICKG)

Via

Access Paper or Ask Questions