Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianfeng Gao

Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Jun 25, 2021

Yu Wang, Jinchao Li, Tristan Naumann, Chenyan Xiong, Hao Cheng, Robert Tinn, Cliff Wong, Naoto Usuyama, Richard Rogahn, Zhihong Shen(+5 more)

Figure 1 for Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Figure 2 for Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Figure 3 for Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Figure 4 for Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Abstract:Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.

Via

Access Paper or Ask Questions

Efficient Self-supervised Vision Transformers for Representation Learning

Jun 17, 2021

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao

Figure 1 for Efficient Self-supervised Vision Transformers for Representation Learning

Figure 2 for Efficient Self-supervised Vision Transformers for Representation Learning

Figure 3 for Efficient Self-supervised Vision Transformers for Representation Learning

Figure 4 for Efficient Self-supervised Vision Transformers for Representation Learning

Abstract:This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

* 24 pages, 12 figures, file size 13.6MB

Via

Access Paper or Ask Questions

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Jun 12, 2021

Subhabrata Mukherjee, Ahmed Hassan Awadallah, Jianfeng Gao

Figure 1 for XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Figure 2 for XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Figure 3 for XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Figure 4 for XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Abstract:While deep and large pre-trained models are the state-of-the-art for various natural language processing tasks, their huge size poses significant challenges for practical uses in resource constrained settings. Recent works in knowledge distillation propose task-agnostic as well as task-specific methods to compress these models, with task-specific ones often yielding higher compression rate. In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage of task-specific methods for learning a small universal model that can be applied to arbitrary tasks and languages. To this end, we study the transferability of several source tasks, augmentation resources and model architecture for distillation. We evaluate our model performance on multiple tasks, including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD question answering dataset and a massive multi-lingual NER dataset with 41 languages. We release three distilled task-agnostic checkpoints with 13MM, 22MM and 33MM parameters obtaining SOTA performance in several tasks.

* Code and checkpoints released (links in draft)

Via

Access Paper or Ask Questions

Joint Retrieval and Generation Training for Grounded Text Generation

Jun 03, 2021

Yizhe Zhang, Siqi Sun, Xiang Gao, Yuwei Fang, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan

Figure 1 for Joint Retrieval and Generation Training for Grounded Text Generation

Figure 2 for Joint Retrieval and Generation Training for Grounded Text Generation

Figure 3 for Joint Retrieval and Generation Training for Grounded Text Generation

Figure 4 for Joint Retrieval and Generation Training for Grounded Text Generation

Abstract:Recent advances in large-scale pre-training such as GPT-3 allow seemingly high quality text to be generated from a given prompt. However, such generation systems often suffer from problems of hallucinated facts, and are not inherently designed to incorporate useful external information. Grounded generation models appear to offer remedies, but their training typically relies on rarely-available parallel data where corresponding information-relevant documents are provided for context. We propose a framework that alleviates this data constraint by jointly training a grounded generator and document retriever on the language model signal. The model learns to reward retrieval of the documents with the highest utility in generation, and attentively combines them using a Mixture-of-Experts (MoE) ensemble to generate follow-on text. We demonstrate that both generator and retriever can take advantage of this joint training and work synergistically to produce more informative and relevant text in both prose and dialogue generation.

Via

Access Paper or Ask Questions

Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Jun 02, 2021

Yichen Jiang, Asli Celikyilmaz, Paul Smolensky, Paul Soulos, Sudha Rao, Hamid Palangi, Roland Fernandez, Caitlin Smith, Mohit Bansal, Jianfeng Gao

Figure 1 for Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Figure 2 for Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Figure 3 for Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Figure 4 for Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Abstract:Abstractive summarization, the task of generating a concise summary of input documents, requires: (1) reasoning over the source document to determine the salient pieces of information scattered across the long document, and (2) composing a cohesive text by reconstructing these salient facts into a shorter summary that faithfully reflects the complex relations connecting these facts. In this paper, we adapt TP-TRANSFORMER (Schlag et al., 2019), an architecture that enriches the original Transformer (Vaswani et al., 2017) with the explicitly compositional Tensor Product Representation (TPR), for the task of abstractive summarization. The key feature of our model is a structural bias that we introduce by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately. The model then binds the role and filler vectors into the TPR as the layer output. We argue that the structured intermediate representations enable the model to take better control of the contents (salient facts) and structures (the syntax that connects the facts) when generating the summary. Empirically, we show that our TP-TRANSFORMER outperforms the Transformer and the original TP-TRANSFORMER significantly on several abstractive summarization datasets based on both automatic and human evaluations. On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and improved syntactic interpretability in the TPR layer outputs. Code and models are available at https://github.com/jiangycTarheel/TPT-Summ.

* NAACL 2021 (14 pages)

Via

Access Paper or Ask Questions

Compositional Processing Emerges in Neural Networks Solving Math Problems

May 19, 2021

Jacob Russin, Roland Fernandez, Hamid Palangi, Eric Rosen, Nebojsa Jojic, Paul Smolensky, Jianfeng Gao

Figure 1 for Compositional Processing Emerges in Neural Networks Solving Math Problems

Figure 2 for Compositional Processing Emerges in Neural Networks Solving Math Problems

Figure 3 for Compositional Processing Emerges in Neural Networks Solving Math Problems

Abstract:A longstanding question in cognitive science concerns the learning mechanisms underlying compositionality in human cognition. Humans can infer the structured relationships (e.g., grammatical rules) implicit in their sensory observations (e.g., auditory speech), and use this knowledge to guide the composition of simpler meanings into complex wholes. Recent progress in artificial neural networks has shown that when large models are trained on enough linguistic data, grammatical structure emerges in their representations. We extend this work to the domain of mathematical reasoning, where it is possible to formulate precise hypotheses about how meanings (e.g., the quantities corresponding to numerals) should be composed according to structured rules (e.g., order of operations). Our work shows that neural networks are not only able to infer something about the structured relationships implicit in their training data, but can also deploy this knowledge to guide the composition of individual meanings into composite wholes.

* 7 pages, 2 figures, Accepted to CogSci 2021 for poster presentation

Via

Access Paper or Ask Questions

Targeted Adversarial Training for Natural Language Understanding

Apr 12, 2021

Lis Pereira, Xiaodong Liu, Hao Cheng, Hoifung Poon, Jianfeng Gao, Ichiro Kobayashi

Figure 1 for Targeted Adversarial Training for Natural Language Understanding

Figure 2 for Targeted Adversarial Training for Natural Language Understanding

Figure 3 for Targeted Adversarial Training for Natural Language Understanding

Figure 4 for Targeted Adversarial Training for Natural Language Understanding

Abstract:We present a simple yet effective Targeted Adversarial Training (TAT) algorithm to improve adversarial training for natural language understanding. The key idea is to introspect current mistakes and prioritize adversarial training steps to where the model errs the most. Experiments show that TAT can significantly improve accuracy over standard adversarial training on GLUE and attain new state-of-the-art zero-shot results on XNLI. Our code will be released at: https://github.com/namisan/mt-dnn.

* 9 pages, 4 tables, 3 figurers, NAACL 2021

Via

Access Paper or Ask Questions

Adversarial Training as Stackelberg Game: An Unrolled Optimization Approach

Apr 11, 2021

Simiao Zuo, Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Jianfeng Gao, Weizhu Chen, Tuo Zhao

Figure 1 for Adversarial Training as Stackelberg Game: An Unrolled Optimization Approach

Figure 2 for Adversarial Training as Stackelberg Game: An Unrolled Optimization Approach

Figure 3 for Adversarial Training as Stackelberg Game: An Unrolled Optimization Approach

Figure 4 for Adversarial Training as Stackelberg Game: An Unrolled Optimization Approach

Abstract:Adversarial training has been shown to improve the generalization performance of deep learning models in various natural language processing tasks. Existing works usually formulate adversarial training as a zero-sum game, which is solved by alternating gradient descent/ascent algorithms. Such a formulation treats the adversarial and the defending players equally, which is undesirable because only the defending player contributes to the generalization performance. To address this issue, we propose Stackelberg Adversarial Training (SALT), which formulates adversarial training as a Stackelberg game. This formulation induces a competition between a leader and a follower, where the follower generates perturbations, and the leader trains the model subject to the perturbations. Different from conventional adversarial training, in SALT, the leader is in an advantageous position. When the leader moves, it recognizes the strategy of the follower and takes the anticipated follower's outcomes into consideration. Such a leader's advantage enables us to improve the model fitting to the unperturbed data. The leader's strategic information is captured by the Stackelberg gradient, which is obtained using an unrolling algorithm. Our experimental results on a set of machine translation and natural language understanding tasks show that SALT outperforms existing adversarial training baselines across all tasks.

Via

Access Paper or Ask Questions

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Mar 29, 2021

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao

Figure 1 for Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Figure 2 for Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Figure 3 for Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Figure 4 for Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Abstract:This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code used in this study will be released to public soon.

Via

Access Paper or Ask Questions

Token-wise Curriculum Learning for Neural Machine Translation

Mar 20, 2021

Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Tuo Zhao

Figure 1 for Token-wise Curriculum Learning for Neural Machine Translation

Figure 2 for Token-wise Curriculum Learning for Neural Machine Translation

Figure 3 for Token-wise Curriculum Learning for Neural Machine Translation

Figure 4 for Token-wise Curriculum Learning for Neural Machine Translation

Abstract:Existing curriculum learning approaches to Neural Machine Translation (NMT) require sampling sufficient amounts of "easy" samples from training data at the early training stage. This is not always achievable for low-resource languages where the amount of training data is limited. To address such limitation, we propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples. Specifically, the model learns to predict a short sub-sequence from the beginning part of each target sentence at the early stage of training, and then the sub-sequence is gradually expanded as the training progresses. Such a new curriculum design is inspired by the cumulative effect of translation errors, which makes the latter tokens more difficult to predict than the beginning ones. Extensive experiments show that our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages. Combining our approach with sentence-level methods further improves the performance on high-resource languages.

Via

Access Paper or Ask Questions