Alert button
Picture for Michelle Yuan

Michelle Yuan

Alert button

MegaWika: Millions of reports and their sources across 50 diverse languages

Jul 13, 2023
Samuel Barham, Orion Weller, Michelle Yuan, Kenton Murray, Mahsa Yarmohammadi, Zhengping Jiang, Siddharth Vashishtha, Alexander Martin, Anqi Liu, Aaron Steven White, Jordan Boyd-Graber, Benjamin Van Durme

Figure 1 for MegaWika: Millions of reports and their sources across 50 diverse languages
Figure 2 for MegaWika: Millions of reports and their sources across 50 diverse languages
Figure 3 for MegaWika: Millions of reports and their sources across 50 diverse languages
Figure 4 for MegaWika: Millions of reports and their sources across 50 diverse languages

To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.

* Submitted to ACL, 2023 
Viaarxiv icon

InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

May 24, 2023
Ishani Mondal, Michelle Yuan, Anandhavelu N, Aparna Garimella, Francis Ferraro, Andrew Blair-Stanek, Benjamin Van Durme, Jordan Boyd-Graber

Figure 1 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Figure 2 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Figure 3 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Figure 4 for InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

Learning template based information extraction from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain templates; however, real-world IE do not have pre-defined schemas and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a real-world setting, we need to induce template slots from documents with zero or minimal supervision. Since the purpose of question answering intersect with the goal of information extraction, we use automatic question generation to induce template slots from the documents and investigate how a tiny amount of a proxy human-supervision on-the-fly (termed as InteractiveIE) can further boost the performance. Extensive experiments on biomedical and legal documents, where obtaining training data is expensive, reveal encouraging trends of performance improvement using InteractiveIE over AI-only baseline.

Viaarxiv icon

Adaptive Active Learning for Coreference Resolution

Apr 15, 2021
Michelle Yuan, Patrick Xia, Benjamin Van Durme, Jordan Boyd-Graber

Figure 1 for Adaptive Active Learning for Coreference Resolution
Figure 2 for Adaptive Active Learning for Coreference Resolution
Figure 3 for Adaptive Active Learning for Coreference Resolution

Training coreference resolution models require comprehensively labeled data. A model trained on one dataset may not successfully transfer to new domains. This paper investigates an approach to active learning for coreference resolution that feeds discrete annotations to an incremental clustering model. The recent developments in incremental coreference resolution allow for a novel approach to active learning in this setting. Through this new framework, we analyze important factors in data acquisition, like sources of model uncertainty and balancing reading and labeling costs. We explore different settings through simulated labeling with gold data. By lowering the data barrier for coreference, coreference resolvers can rapidly adapt to a series of previously unconsidered domains.

Viaarxiv icon

Cold-start Active Learning through Self-supervised Language Modeling

Oct 22, 2020
Michelle Yuan, Hsuan-Tien Lin, Jordan Boyd-Graber

Figure 1 for Cold-start Active Learning through Self-supervised Language Modeling
Figure 2 for Cold-start Active Learning through Self-supervised Language Modeling
Figure 3 for Cold-start Active Learning through Self-supervised Language Modeling
Figure 4 for Cold-start Active Learning through Self-supervised Language Modeling

Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NLP provides an additional source of information: pre-trained language models. The pre-training loss can find examples that surprise the model and should be labeled for efficient fine-tuning. Therefore, we treat the language modeling loss as a proxy for classification uncertainty. With BERT, we develop a simple strategy based on the masked language modeling loss that minimizes labeling costs for text classification. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and computation time.

* Published in EMNLP 2020 
Viaarxiv icon

Interactive Refinement of Cross-Lingual Word Embeddings

Nov 08, 2019
Michelle Yuan, Mozhi Zhang, Benjamin Van Durme, Leah Findlater, Jordan Boyd-Graber

Figure 1 for Interactive Refinement of Cross-Lingual Word Embeddings
Figure 2 for Interactive Refinement of Cross-Lingual Word Embeddings
Figure 3 for Interactive Refinement of Cross-Lingual Word Embeddings
Figure 4 for Interactive Refinement of Cross-Lingual Word Embeddings

Cross-lingual word embeddings transfer knowledge between languages: models trained for a high-resource language can be used in a low-resource language. These embeddings are usually trained on general-purpose corpora but used for a domain-specific task. We introduce CLIME, an interactive system that allows a user to quickly adapt cross-lingual word embeddings for a given classification problem. First, words in the vocabulary are ranked by their salience to the downstream task. Then, salient keywords are displayed on an interface. Users mark the similarity between each keyword and its nearest neighbors in the embedding space. Finally, CLIME updates the embeddings using the annotations. We experiment clime on a cross-lingual text classification benchmark for four low-resource languages: Ilocano, Sinhalese, Tigrinya, and Uyghur. Embeddings refined by CLIME capture more nuanced word semantics and have higher test accuracy than the original embeddings. CLIME also improves test accuracy faster than an active learning baseline, and a simple combination of CLIME with active learning has the highest test accuracy.

* First two authors contribute equally 
Viaarxiv icon