Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Mimno

Princeton University

Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

Oct 05, 2022

Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, David Mimno

Figure 1 for Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

Figure 2 for Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

Figure 3 for Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

Figure 4 for Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

Abstract:Explainable question answering systems should produce not only accurate answers but also rationales that justify their reasoning and allow humans to check their work. But what sorts of rationales are useful and how can we train systems to produce them? We propose a new style of rationale for open-book question answering, called \emph{markup-and-mask}, which combines aspects of extractive and free-text explanations. In the markup phase, the passage is augmented with free-text markup that enables each sentence to stand on its own outside the discourse context. In the masking phase, a sub-span of the marked-up passage is selected. To train a system to produce markup-and-mask rationales without annotations, we leverage in-context learning. Specifically, we generate silver annotated data by sending a series of prompts to a frozen pretrained language model, which acts as a teacher. We then fine-tune a smaller student model by training on the subset of rationales that led to correct answers. The student is "honest" in the sense that it is a pipeline: the rationale acts as a bottleneck between the passage and the answer, while the "untrusted" teacher operates under no such constraints. Thus, we offer a new way to build trustworthy pipeline systems from a combination of end-task annotations and frozen pretrained language models.

Via

Access Paper or Ask Questions

On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference

Nov 12, 2021

Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, David Bindel

Figure 1 for On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference

Figure 2 for On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference

Figure 3 for On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference

Figure 4 for On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference

Abstract:Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.

Via

Access Paper or Ask Questions

Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

Sep 22, 2021

A. Feder Cooper, Maria Antoniak, Christopher De Sa, Marilyn Migiel, David Mimno

Figure 1 for Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

Figure 2 for Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

Figure 3 for Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

Figure 4 for Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

Abstract:We explore Boccaccio's Decameron to see how digital humanities tools can be used for tasks that have limited data in a language no longer in contemporary use: medieval Italian. We focus our analysis on the question: Do the different storytellers in the text exhibit distinct personalities? To answer this question, we curate and release a dataset based on the authoritative edition of the text. We use supervised classification methods to predict storytellers based on the stories they tell, confirming the difficulty of the task, and demonstrate that topic modeling can extract thematic storyteller "profiles."

* The 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (co-located with EMNLP 2021)

Via

Access Paper or Ask Questions

Comparing Text Representations: A Theory-Driven Approach

Sep 15, 2021

Gregory Yauney, David Mimno

Figure 1 for Comparing Text Representations: A Theory-Driven Approach

Figure 2 for Comparing Text Representations: A Theory-Driven Approach

Figure 3 for Comparing Text Representations: A Theory-Driven Approach

Figure 4 for Comparing Text Representations: A Theory-Driven Approach

Abstract:Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compatibility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels.

* Published in EMNLP 2021

Via

Access Paper or Ask Questions

Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Oct 30, 2020

Gregory Yauney, Jack Hessel, David Mimno

Figure 1 for Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Figure 2 for Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Figure 3 for Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Figure 4 for Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Abstract:Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant lexical and visual overlap? Working with a case study dataset of real estate listings, we demonstrate the challenge of distinguishing highly correlated grounded terms, such as "kitchen" and "bedroom", and introduce metrics to assess this document similarity. We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines when evaluated on labeled subsets of the dataset. The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.

* Published in EMNLP 2020

Via

Access Paper or Ask Questions

Topic Modeling with Contextualized Word Representation Clusters

Oct 23, 2020

Laure Thompson, David Mimno

Figure 1 for Topic Modeling with Contextualized Word Representation Clusters

Figure 2 for Topic Modeling with Contextualized Word Representation Clusters

Figure 3 for Topic Modeling with Contextualized Word Representation Clusters

Figure 4 for Topic Modeling with Contextualized Word Representation Clusters

Abstract:Clustering token-level contextualized word representations produces output that shares many similarities with topic models for English text collections. Unlike clusterings of vocabulary-level word embeddings, the resulting models more naturally capture polysemy and can be used as a way of organizing documents. We evaluate token clusterings trained from several different output layers of popular contextualized language models. We find that BERT and GPT-2 produce high quality clusterings, but RoBERTa does not. These cluster models are simple, reliable, and can perform as well as, if not better than, LDA topic models, maintaining high topic quality even when the number of topics is large relative to the size of the local collection.

Via

Access Paper or Ask Questions

How we do things with words: Analyzing text as social and cultural data

Jul 02, 2019

Dong Nguyen, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, Jane Winters

Abstract:In this article we describe our experiences with computational text analysis. We hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of best practices for working with thick social and cultural concepts. Our guidance is based on our own experiences and is therefore inherently imperfect. Still, given our diversity of disciplinary backgrounds and research practices, we hope to capture a range of ideas and identify commonalities that will resonate for many. And this leads to our final goal: to help promote interdisciplinary collaborations. Interdisciplinary insights and partnerships are essential for realizing the full potential of any computational text analysis that involves social and cultural concepts, and the more we are able to bridge these divides, the more fruitful we believe our work will be.

Via

Access Paper or Ask Questions

Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Apr 16, 2019

Jack Hessel, Lillian Lee, David Mimno

Figure 1 for Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Figure 2 for Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Figure 3 for Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Figure 4 for Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Abstract:Images and text co-occur everywhere on the web, but explicit links between images and sentences (or other intra-document textual units) are often not annotated by users. We present algorithms that successfully discover image-sentence relationships without relying on any explicit multimodal annotation. We explore several variants of our approach on seven datasets of varying difficulty, ranging from images that were captioned post hoc by crowd-workers to naturally-occurring user-generated multimodal documents, wherein correspondences between illustrations and individual textual units may not be one-to-one. We find that a structured training objective based on identifying whether sets of images and sentences co-occur in documents can be sufficient to predict links between specific sentences and specific images within the same document at test time.

* Working paper; comments welcome. Code and data available at www.cs.cornell.edu/~jhessel

Via

Access Paper or Ask Questions

Quantifying the visual concreteness of words and topics in multimodal datasets

May 23, 2018

Jack Hessel, David Mimno, Lillian Lee

Figure 1 for Quantifying the visual concreteness of words and topics in multimodal datasets

Figure 2 for Quantifying the visual concreteness of words and topics in multimodal datasets

Figure 3 for Quantifying the visual concreteness of words and topics in multimodal datasets

Figure 4 for Quantifying the visual concreteness of words and topics in multimodal datasets

Abstract:Multimodal machine learning algorithms aim to learn visual-textual correspondences. Previous work suggests that concepts with concrete visual manifestations may be easier to learn than concepts with abstract ones. We give an algorithm for automatically computing the visual concreteness of words and topics within multimodal datasets. We apply the approach in four settings, ranging from image captions to images/text scraped from historical books. In addition to enabling explorations of concepts in multimodal datasets, our concreteness scores predict the capacity of machine learning algorithms to learn textual/visual relationships. We find that 1) concrete concepts are indeed easier to learn; 2) the large number of algorithms we consider have similar failure cases; 3) the precise positive relationship between concreteness and performance varies between datasets. We conclude with recommendations for using concreteness scores to facilitate future multimodal research.

* 2018 North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT)
* NAACL HLT 2018, 14 pages, 6 figures, data available at http://www.cs.cornell.edu/~jhessel/concreteness/concreteness.html

Via

Access Paper or Ask Questions

Prior-aware Dual Decomposition: Document-specific Topic Inference for Spectral Topic Models

Nov 19, 2017

Moontae Lee, David Bindel, David Mimno

Figure 1 for Prior-aware Dual Decomposition: Document-specific Topic Inference for Spectral Topic Models

Figure 2 for Prior-aware Dual Decomposition: Document-specific Topic Inference for Spectral Topic Models

Figure 3 for Prior-aware Dual Decomposition: Document-specific Topic Inference for Spectral Topic Models

Figure 4 for Prior-aware Dual Decomposition: Document-specific Topic Inference for Spectral Topic Models

Abstract:Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the topic composition of individual documents. Recently Thresholded Linear Inverse (TLI) is proposed to map the observed words of each document back to its topic composition. However, its linear characteristics limit the inference quality without considering the important prior information over topics. In this paper, we evaluate Simple Probabilistic Inverse (SPI) method and novel Prior-aware Dual Decomposition (PADD) that is capable of learning document-specific topic compositions in parallel. Experiments show that PADD successfully leverages topic correlations as a prior, notably outperforming TLI and learning quality topic compositions comparable to Gibbs sampling on various data.

Via

Access Paper or Ask Questions