Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin Cao

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Jun 15, 2022
Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide Bernardi, Abhishek Bhagia, Claudio Delli Bovi, Jin Cao, Rakesh Chada, Amit Chauhan, Luoxin Chen, Anurag Dwarakanath, Satyam Dwivedi, Turan Gojayev, Karthik Gopalakrishnan, Thomas Gueudre, Dilek Hakkani-Tur, Wael Hamza, Jonathan Hueser, Kevin Martin Jose, Haidar Khan, Beiye Liu, Jianhua Lu, Alessandro Manzotti, Pradeep Natarajan, Karolina Owczarzak, Gokmen Oz, Enrico Palumbo, Charith Peris, Chandana Satya Prakash, Stephen Rawls, Andy Rosenbaum, Anjali Shenoy, Saleh Soltan, Mukund Harakere Sridhar, Liz Tan, Fabian Triefenbach, Pan Wei, Haiyang Yu, Shuai Zheng, Gokhan Tur, Prem Natarajan

Figure 1 for Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Figure 2 for Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Figure 3 for Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Figure 4 for Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.

* Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA
* KDD 2022

Via

Access Paper or Ask Questions

Instilling Type Knowledge in Language Models via Multi-Task QA

Apr 28, 2022
Shuyang Li, Mukund Sridhar, Chandana Satya Prakash, Jin Cao, Wael Hamza, Julian McAuley

Figure 1 for Instilling Type Knowledge in Language Models via Multi-Task QA

Figure 2 for Instilling Type Knowledge in Language Models via Multi-Task QA

Figure 3 for Instilling Type Knowledge in Language Models via Multi-Task QA

Figure 4 for Instilling Type Knowledge in Language Models via Multi-Task QA

Understanding human language often necessitates understanding entities and their place in a taxonomy of knowledge -- their types. Previous methods to learn entity types rely on training classifiers on datasets with coarse, noisy, and incomplete labels. We introduce a method to instill fine-grained type knowledge in language models with text-to-text pre-training on type-centric questions leveraging knowledge base documents and knowledge graphs. We create the WikiWiki dataset: entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types. Models trained on WikiWiki achieve state-of-the-art performance in zero-shot dialog state tracking benchmarks, accurately infer entity types in Wikipedia articles, and can discover new types deemed useful by human judges.

* Findings of NAACL 2022; dataset link: https://github.com/amazon-research/wikiwiki-dataset

Via

Access Paper or Ask Questions

DAME: Domain Adaptation for Matching Entities

Apr 20, 2022
Mohamed Trabelsi, Jeff Heflin, Jin Cao

Figure 1 for DAME: Domain Adaptation for Matching Entities

Figure 2 for DAME: Domain Adaptation for Matching Entities

Figure 3 for DAME: Domain Adaptation for Matching Entities

Figure 4 for DAME: Domain Adaptation for Matching Entities

Entity matching (EM) identifies data records that refer to the same real-world entity. Despite the effort in the past years to improve the performance in EM, the existing methods still require a huge amount of labeled data in each domain during the training phase. These methods treat each domain individually, and capture the specific signals for each dataset in EM, and this leads to overfitting on just one dataset. The knowledge that is learned from one dataset is not utilized to better understand the EM task in order to make predictions on the unseen datasets with fewer labeled samples. In this paper, we propose a new domain adaptation-based method that transfers the task knowledge from multiple source domains to a target domain. Our method presents a new setting for EM where the objective is to capture the task-specific knowledge from pretraining our model using multiple source domains, then testing our model on a target domain. We study the zero-shot learning case on the target domain, and demonstrate that our method learns the EM task and transfers knowledge to the target domain. We extensively study fine-tuning our model on the target dataset from multiple domains, and demonstrate that our model generalizes better than state-of-the-art methods in EM.

* Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining 2022

Via

Access Paper or Ask Questions

Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN

Jun 22, 2021
Haowei Jiang, Feiwei Qin, Jin Cao, Yong Peng, Yanli Shao

Figure 1 for Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN

Figure 2 for Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN

Figure 3 for Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN

Figure 4 for Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN

The recurrent network architecture is a widely used model in sequence modeling, but its serial dependency hinders the computation parallelization, which makes the operation inefficient. The same problem was encountered in serial adder at the early stage of digital electronics. In this paper, we discuss the similarities between recurrent neural network (RNN) and serial adder. Inspired by carry-lookahead adder, we introduce carry-lookahead module to RNN, which makes it possible for RNN to run in parallel. Then, we design the method of parallel RNN computation, and finally Carry-lookahead RNN (CL-RNN) is proposed. CL-RNN takes advantages in parallelism and flexible receptive field. Through a comprehensive set of tests, we verify that CL-RNN can perform better than existing typical RNNs in sequence modeling tasks which are specially designed for RNNs.

Via

Access Paper or Ask Questions

Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Jan 20, 2021
Shuyang Li, Jin Cao, Mukund Sridhar, Henghui Zhu, Shang-Wen Li, Wael Hamza, Julian McAuley

Figure 1 for Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Figure 2 for Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Figure 3 for Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Figure 4 for Zero-shot Generalization in Dialog State Tracking through Generative Question Answering

Dialog State Tracking (DST), an integral part of modern dialog systems, aims to track user preferences and constraints (slots) in task-oriented dialogs. In real-world settings with constantly changing services, DST systems must generalize to new domains and unseen slot types. Existing methods for DST do not generalize well to new slot names and many require known ontologies of slot types and values for inference. We introduce a novel ontology-free framework that supports natural language queries for unseen constraints and slots in multi-domain task-oriented dialogs. Our approach is based on generative question-answering using a conditional language model pre-trained on substantive English sentences. Our model improves joint goal accuracy in zero-shot domain adaptation settings by up to 9% (absolute) over the previous state-of-the-art on the MultiWOZ 2.1 dataset.

* Accepted as a Long Paper at EACL 2021

Via

Access Paper or Ask Questions

Towards Semi-Supervised Semantics Understanding from Speech

Nov 11, 2020
Cheng-I Lai, Jin Cao, Sravan Bodapati, Shang-Wen Li

Figure 1 for Towards Semi-Supervised Semantics Understanding from Speech

Figure 2 for Towards Semi-Supervised Semantics Understanding from Speech

Figure 3 for Towards Semi-Supervised Semantics Understanding from Speech

Figure 4 for Towards Semi-Supervised Semantics Understanding from Speech

Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.

* arXiv admin note: text overlap with arXiv:2010.13826

Via

Access Paper or Ask Questions

Semantic Labeling Using a Deep Contextualized Language Model

Oct 30, 2020
Mohamed Trabelsi, Jin Cao, Jeff Heflin

Figure 1 for Semantic Labeling Using a Deep Contextualized Language Model

Figure 2 for Semantic Labeling Using a Deep Contextualized Language Model

Figure 3 for Semantic Labeling Using a Deep Contextualized Language Model

Figure 4 for Semantic Labeling Using a Deep Contextualized Language Model

Generating schema labels automatically for column values of data tables has many data science applications such as schema matching, and data discovery and linking. For example, automatically extracted tables with missing headers can be filled by the predicted schema labels which significantly minimizes human effort. Furthermore, the predicted labels can reduce the impact of inconsistent names across multiple data tables. Understanding the connection between column values and contextual information is an important yet neglected aspect as previously proposed methods treat each column independently. In this paper, we propose a context-aware semantic labeling method using both the column values and context. Our new method is based on a new setting for semantic labeling, where we sequentially predict labels for an input table with missing headers. We incorporate both the values and context of each data column using the pre-trained contextualized language model, BERT, that has achieved significant improvements in multiple natural language processing tasks. To our knowledge, we are the first to successfully apply BERT to solve the semantic labeling task. We evaluate our approach using two real-world datasets from different domains, and we demonstrate substantial improvements in terms of evaluation metrics over state-of-the-art feature-based methods.

Via

Access Paper or Ask Questions

Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding

Oct 09, 2020
Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, Shang-Wen Li

Figure 1 for Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding

Figure 2 for Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding

Figure 3 for Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding

Figure 4 for Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding

Neural models have yielded state-of-the-art results in deciphering spoken language understanding (SLU) problems; however, these models require a significant amount of domain-specific labeled examples for training, which is prohibitively expensive. While pre-trained language models like BERT have been shown to capture a massive amount of knowledge by learning from unlabeled corpora and solve SLU using fewer labeled examples for adaption, the encoding of knowledge is implicit and agnostic to downstream tasks. Such encoding results in model inefficiencies in parameter usage: an entirely new model is required for every domain. To address these challenges, we introduce a novel SLU framework, comprising a conversational language modeling (CLM) pre-training task and a light encoder architecture. The CLM pre-training enables networks to capture the representation of the language in conversation style with the presence of ASR errors. The light encoder architecture separates the shared pre-trained networks from the mappings of generally encoded knowledge to specific domains of SLU, allowing for the domain adaptation to be performed solely at the light encoder and thus increasing efficiency. With the framework, we match the performance of state-of-the-art SLU results on Alexa internal datasets and on two public ones (ATIS, SNIPS), adding only 4.4% parameters per task.

* Accepted at INTERSPEECH 2020

Via

Access Paper or Ask Questions

A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Sep 07, 2020
Jin Cao, Yibo Zhao, Linjun Zhang, Jason Li

Figure 1 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Figure 2 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Figure 3 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Figure 4 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Many data we collect today are in tabular form, with rows as records and columns as attributes associated with each record. Understanding the structural relationship in tabular data can greatly facilitate the data science process. Traditionally, much of this relational information is stored in table schema and maintained by its creators, usually domain experts. In this paper, we develop automated methods to uncover deep relationships in a single data table without expert or domain knowledge. Our method can decompose a data table into layers of smaller tables, revealing its deep structure. The key to our approach is a computationally lightweight forward addition algorithm that we developed to recursively extract the functional dependencies between table columns that are scalable to tables with many columns. With our solution, data scientists will be provided with automatically generated, data-driven insights when exploring new data sets.

* 9 pages, 4 figures, paper presented on AutoML 2019 (The Third International Workshop on Automation in Machine Learning)

Via

Access Paper or Ask Questions

A Fast Randomized Algorithm for Finding the Maximal Common Subsequences

Sep 07, 2020
Jin Cao, Dewei Zhong

Figure 1 for A Fast Randomized Algorithm for Finding the Maximal Common Subsequences

Figure 2 for A Fast Randomized Algorithm for Finding the Maximal Common Subsequences

Figure 3 for A Fast Randomized Algorithm for Finding the Maximal Common Subsequences

Figure 4 for A Fast Randomized Algorithm for Finding the Maximal Common Subsequences

Finding the common subsequences of $L$ multiple strings has many applications in the area of bioinformatics, computational linguistics, and information retrieval. A well-known result states that finding a Longest Common Subsequence (LCS) for $L$ strings is NP-hard, e.g., the computational complexity is exponential in $L$. In this paper, we develop a randomized algorithm, referred to as {\em Random-MCS}, for finding a random instance of Maximal Common Subsequence ($MCS$) of multiple strings. A common subsequence is {\em maximal} if inserting any character into the subsequence no longer yields a common subsequence. A special case of MCS is LCS where the length is the longest. We show the complexity of our algorithm is linear in $L$, and therefore is suitable for large $L$. Furthermore, we study the occurrence probability for a single instance of MCS and demonstrate via both theoretical and experimental studies that the longest subsequence from multiple runs of {\em Random-MCS} often yields a solution to $LCS$.

* 9 pages

Via

Access Paper or Ask Questions