Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

James Glass

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Dec 20, 2022
Tianxing He, Jingyu Zhang, Tianle Wang, Sachin Kumar, Kyunghyun Cho, James Glass, Yulia Tsvetkov

Figure 1 for On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Figure 2 for On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Figure 3 for On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Figure 4 for On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore ignores truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation.

Via

Access Paper or Ask Questions

On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

Nov 14, 2022
Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

Figure 1 for On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

Figure 2 for On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

Figure 3 for On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

Figure 4 for On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data. The process is iterative, where the student becomes the teacher for the next DUST iteration. A crucial step that precedes the student model training in each DUST iteration is filtering out noisy PLs that could lead the student model astray. In DUST, we proposed a simple, effective, and theoretically sound PL filtering strategy based on the teacher model's uncertainty about its predictions on unlabeled speech utterances. We estimate the model's uncertainty by computing disagreement amongst multiple samples drawn from the teacher model during inference by injecting noise via dropout. In this work, we show that DUST's PL filtering, as initially used, may fail under severe source and target domain mismatch. We suggest several approaches to eliminate or alleviate this issue. Further, we bring insights from the research in neural network model calibration to DUST and show that a well-calibrated model correlates strongly with a positive outcome of the DUST PL filtering step.

Via

Access Paper or Ask Questions

PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation

Oct 14, 2022
Jingyu Zhang, James Glass, Tianxing He

Figure 1 for PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation

Figure 2 for PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation

Figure 3 for PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation

Figure 4 for PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation

Existing work on controlled text generation (CTG) assumes a control interface of categorical attributes. In this work, we propose a natural language (NL) interface, where we craft a PCFG to embed the control attributes into natural language commands, and propose variants of existing CTG models that take commands as input. In our experiments, we design tailored setups to test model's generalization abilities. We find our PCFG-based command generation approach is effective for handling unseen commands compared to fix-set templates; our proposed NL models can effectively generalize to unseen attributes, a new ability enabled by the NL interface, as well as unseen attribute combinations. Interestingly, we discover that the simple conditional generation approach, enhanced with our proposed NL interface, is a strong baseline in those challenging settings.

Via

Access Paper or Ask Questions

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Oct 07, 2022
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Figure 1 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Figure 2 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Figure 3 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Figure 4 for C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers.

Via

Access Paper or Ask Questions

UAVM: A Unified Model for Audio-Visual Learning

Jul 29, 2022
Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

Figure 1 for UAVM: A Unified Model for Audio-Visual Learning

Figure 2 for UAVM: A Unified Model for Audio-Visual Learning

Figure 3 for UAVM: A Unified Model for Audio-Visual Learning

Figure 4 for UAVM: A Unified Model for Audio-Visual Learning

Conventional audio-visual models have independent audio and video branches. We design a unified model for audio and video processing called Unified Audio-Visual Model (UAVM). In this paper, we describe UAVM, report its new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound, and describe the intriguing properties of the model.

Via

Access Paper or Ask Questions

Developing a Series of AI Challenges for the United States Department of the Air Force

Jul 14, 2022
Vijay Gadepally, Gregory Angelides, Andrei Barbu, Andrew Bowne, Laura J. Brattain, Tamara Broderick, Armando Cabrera, Glenn Carl, Ronisha Carter, Miriam Cha, Emilie Cowen, Jesse Cummings, Bill Freeman, James Glass, Sam Goldberg, Mark Hamilton, Thomas Heldt, Kuan Wei Huang, Phillip Isola, Boris Katz, Jamie Koerner, Yen-Chen Lin, David Mayo, Kyle McAlpin, Taylor Perron, Jean Piou, Hrishikesh M. Rao, Hayley Reynolds, Kaira Samuel, Siddharth Samsi, Morgan Schmidt, Leslie Shing, Olga Simek, Brandon Swenson, Vivienne Sze, Jonathan Taylor, Paul Tylkin, Mark Veillette, Matthew L Weiss, Allan Wollaber, Sophia Yuditskaya, Jeremy Kepner

Figure 1 for Developing a Series of AI Challenges for the United States Department of the Air Force

Figure 2 for Developing a Series of AI Challenges for the United States Department of the Air Force

Figure 3 for Developing a Series of AI Challenges for the United States Department of the Air Force

Figure 4 for Developing a Series of AI Challenges for the United States Department of the Air Force

Through a series of federal initiatives and orders, the U.S. Government has been making a concerted effort to ensure American leadership in AI. These broad strategy documents have influenced organizations such as the United States Department of the Air Force (DAF). The DAF-MIT AI Accelerator is an initiative between the DAF and MIT to bridge the gap between AI researchers and DAF mission requirements. Several projects supported by the DAF-MIT AI Accelerator are developing public challenge problems that address numerous Federal AI research priorities. These challenges target priorities by making large, AI-ready datasets publicly available, incentivizing open-source solutions, and creating a demand signal for dual use technologies that can stimulate further research. In this article, we describe these public challenges being developed and how their application contributes to scientific advances.

Via

Access Paper or Ask Questions

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

May 17, 2022
Sameer Khurana, Antoine Laurent, James Glass

Figure 1 for SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

Figure 2 for SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

Figure 3 for SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

Figure 4 for SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.

Via

Access Paper or Ask Questions

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

May 06, 2022
Yuan Gong, Jin Yu, James Glass

Figure 1 for Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

Figure 2 for Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

Figure 3 for Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

Figure 4 for Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support research on building robust and accurate vocal sound recognition, we have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material. In addition, different from previous datasets, the VocalSound dataset contains meta information such as speaker age, gender, native language, country, and health condition.

* Accepted at ICASSP 2022. Dataset and code at https://github.com/YuanGongND/vocalsound Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/vocalsound/blob/main/colab/VocalSound.ipynb

Via

Access Paper or Ask Questions

Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

May 06, 2022
Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass

Figure 1 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Figure 2 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Figure 3 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Figure 4 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at multiple granularities. Specifically, we train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning. Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.

* Accepted at ICASSP 2022. Code at https://github.com/YuanGongND/gopt Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/gopt/blob/master/colab/GOPT_GPU.ipynb . ICASSP 2022

Via

Access Paper or Ask Questions

DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

Apr 21, 2022
Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Wen-tau Yih, Yoon Kim, James Glass

Figure 1 for DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

Figure 2 for DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

Figure 3 for DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

Figure 4 for DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

We propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings. DiffCSE learns sentence embeddings that are sensitive to the difference between the original sentence and an edited sentence, where the edited sentence is obtained by stochastically masking out the original sentence and then sampling from a masked language model. We show that DiffSCE is an instance of equivariant contrastive learning (Dangovski et al., 2021), which generalizes contrastive learning and learns representations that are insensitive to certain types of augmentations and sensitive to other "harmful" types of augmentations. Our experiments show that DiffCSE achieves state-of-the-art results among unsupervised sentence representation learning methods, outperforming unsupervised SimCSE by 2.3 absolute points on semantic textual similarity tasks.

* NAACL 2022 main conference (Long paper). Pretrained models and code are available at https://github.com/voidism/DiffCSE

Via

Access Paper or Ask Questions