Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Furu Wei

Document AI: Benchmarks, Models and Applications

Nov 16, 2021
Lei Cui, Yiheng Xu, Tengchao Lv, Furu Wei

Figure 1 for Document AI: Benchmarks, Models and Applications

Figure 2 for Document AI: Benchmarks, Models and Applications

Figure 3 for Document AI: Benchmarks, Models and Applications

Figure 4 for Document AI: Benchmarks, Models and Applications

Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques for automatically reading, understanding, and analyzing business documents. It is an important research direction for natural language processing and computer vision. In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI, such as document layout analysis, visual information extraction, document visual question answering, document image classification, etc. This paper briefly reviews some of the representative models, tasks, and benchmark datasets. Furthermore, we also introduce early-stage heuristic rule-based document analysis, statistical machine learning algorithms, and deep learning approaches especially pre-training methods. Finally, we look into future directions for Document AI research.

Via

Access Paper or Ask Questions

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Nov 03, 2021
Wenhui Wang, Hangbo Bao, Li Dong, Furu Wei

Figure 1 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Figure 2 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Figure 3 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Figure 4 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA and NLVR2. The code and pretrained models are available at https://aka.ms/vlmo.

* Work in progress

Via

Access Paper or Ask Questions

Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task

Nov 03, 2021
Jian Yang, Shuming Ma, Haoyang Huang, Dongdong Zhang, Li Dong, Shaohan Huang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, Furu Wei

Figure 1 for Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task

Figure 2 for Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task

Figure 3 for Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task

This report describes Microsoft's machine translation systems for the WMT21 shared task on large-scale multilingual machine translation. We participated in all three evaluation tracks including Large Track and two Small Tracks where the former one is unconstrained and the latter two are fully constrained. Our model submissions to the shared task were initialized with DeltaLM\footnote{\url{https://aka.ms/deltalm}}, a generic pre-trained multilingual encoder-decoder model, and fine-tuned correspondingly with the vast collected parallel data and allowed data sources according to track settings, together with applying progressive learning and iterative back-translation approaches to further improve the performance. Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.

* WMT21

Via

Access Paper or Ask Questions

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Oct 29, 2021
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei

Figure 1 for WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Figure 2 for WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Figure 3 for WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Figure 4 for WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pretrained models are available at https://aka.ms/wavlm.

Via

Access Paper or Ask Questions

Separating Long-Form Speech with Group-Wise Permutation Invariant Training

Oct 27, 2021
Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

Figure 1 for Separating Long-Form Speech with Group-Wise Permutation Invariant Training

Figure 2 for Separating Long-Form Speech with Group-Wise Permutation Invariant Training

Figure 3 for Separating Long-Form Speech with Group-Wise Permutation Invariant Training

Figure 4 for Separating Long-Form Speech with Group-Wise Permutation Invariant Training

Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the existing utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it lacks the ability to leverage the long-span relationship of utterances and is computationally inefficient due to the highly overlapped sliding windows. To overcome these drawbacks, we propose a novel training scheme named Group-PIT, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment. Two different speech separation approaches with Group-PIT are explored, including direct long-span speech separation and short-span speech separation with long-span tracking. The experiments on the simulated meeting-style data demonstrate the effectiveness of our proposed approaches, especially in dealing with a very long speech input.

* 5 pages, 3 figures, 3 tables, submitted to IEEE ICASSP 2022

Via

Access Paper or Ask Questions

s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Oct 26, 2021
Hangbo Bao, Li Dong, Wenhui Wang, Nan Yang, Furu Wei

Figure 1 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Figure 2 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Figure 3 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Figure 4 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Pretrained bidirectional Transformers, such as BERT, have achieved significant improvements in a wide variety of language understanding tasks, while it is not straightforward to directly apply them for natural language generation. In this paper, we present a sequence-to-sequence fine-tuning toolkit s2s-ft, which adopts pretrained Transformers for conditional generation tasks. Inspired by UniLM, we implement three sequence-to-sequence fine-tuning algorithms, namely, causal fine-tuning, masked fine-tuning, and pseudo-masked fine-tuning. By leveraging the existing pretrained bidirectional Transformers, experimental results show that s2s-ft achieves strong performance on several benchmarks of abstractive summarization, and question generation. Moreover, we demonstrate that the package s2s-ft supports both monolingual and multilingual NLG tasks. The s2s-ft toolkit is available at https://github.com/microsoft/unilm/tree/master/s2s-ft.

* Demo paper for the s2s-ft toolkit: https://github.com/microsoft/unilm/tree/master/s2s-ft

Via

Access Paper or Ask Questions

Improving Non-autoregressive Generation with Mixup Training

Oct 21, 2021
Ting Jiang, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Liangjie Zhang, Qi Zhang

Figure 1 for Improving Non-autoregressive Generation with Mixup Training

Figure 2 for Improving Non-autoregressive Generation with Mixup Training

Figure 3 for Improving Non-autoregressive Generation with Mixup Training

Figure 4 for Improving Non-autoregressive Generation with Mixup Training

While pre-trained language models have achieved great success on various natural language understanding tasks, how to effectively leverage them into non-autoregressive generation tasks remains a challenge. To solve this problem, we present a non-autoregressive generation model based on pre-trained transformer models. To bridge the gap between autoregressive and non-autoregressive models, we propose a simple and effective iterative training method called MIx Source and pseudo Target (MIST). Unlike other iterative decoding methods, which sacrifice the inference speed to achieve better performance based on multiple decoding iterations, MIST works in the training stage and has no effect on inference time. Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results for fully non-autoregressive models. We also demonstrate that our method can be used to a variety of pre-trained models. For instance, MIST based on the small pre-trained model also obtains comparable performance with seq2seq models.

Via

Access Paper or Ask Questions

Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation

Oct 16, 2021
Guanhua Chen, Shuming Ma, Yun Chen, Dongdong Zhang, Jia Pan, Wenping Wang, Furu Wei

Figure 1 for Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation

Figure 2 for Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation

Figure 3 for Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation

Figure 4 for Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation

This paper demonstrates that multilingual pretraining, a proper fine-tuning method and a large-scale parallel dataset from multiple auxiliary languages are all critical for zero-shot translation, where the NMT model is tested on source languages unseen during supervised training. Following this idea, we present SixT++, a strong many-to-English NMT model that supports 100 source languages but is trained once with a parallel dataset from only six source languages. SixT++ initializes the decoder embedding and the full encoder with XLM-R large, and then trains the encoder and decoder layers with a simple two-stage training strategy. SixT++ achieves impressive performance on many-to-English translation. It significantly outperforms CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively. Additionally, SixT++ offers a set of model parameters that can be further fine-tuned to develop unsupervised NMT models for low-resource languages. With back-translation on monolingual data of low-resource language, it outperforms all current state-of-the-art unsupervised methods on Nepali and Sinhal for both translating into and from English.

* Preprint

Via

Access Paper or Ask Questions

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Oct 16, 2021
Junlong Li, Yiheng Xu, Lei Cui, Furu Wei

Figure 1 for MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Figure 2 for MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Figure 3 for MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Figure 4 for MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/markuplm.

* Work in Progress

Via

Access Paper or Ask Questions

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Oct 14, 2021
Junyi Ao, Rui Wang, Long Zhou, Shujie Liu, Shuo Ren, Yu Wu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Figure 1 for SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Figure 2 for SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Figure 3 for SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Figure 4 for SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-training natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the speech/text input through the pre-nets, the shared encoder-decoder network models the sequence to sequence transformation, and then the post-nets generate the output in the speech/text modality based on the decoder output. Particularly, SpeechT5 can pre-train on a large scale of unlabeled speech and text data to improve the capability of the speech and textual modeling. To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text. Extensive evaluations on a wide variety of spoken language processing tasks, including voice conversion, automatic speech recognition, text to speech, and speaker identification, show the superiority of the proposed SpeechT5 framework.

* work in process

Via

Access Paper or Ask Questions