Alert button
Picture for Giannis Karamanolakis

Giannis Karamanolakis

Alert button

Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

Apr 16, 2022
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Hannaneh Hajishirzi, Noah A. Smith, Daniel Khashabi

Figure 1 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Figure 2 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Figure 3 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Figure 4 for Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce Natural-Instructions v2, a collection of 1,600+ diverse language tasks and their expert written instructions. More importantly, the benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting. This benchmark is collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality. This benchmark enables large-scale evaluation of cross-task generalization of the models -- training on a subset of tasks and evaluating on the remaining unseen ones. For instance, we are able to rigorously quantify generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances, and model sizes. As a by-product of these experiments. we introduce Tk-Instruct, an encoder-decoder Transformer that is trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples) which outperforms existing larger models on our benchmark. We hope this benchmark facilitates future progress toward more general-purpose language understanding models.

* 16 pages, 9 figures 
Viaarxiv icon

WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding

Aug 28, 2021
Guoqing Zheng, Giannis Karamanolakis, Kai Shu, Ahmed Hassan Awadallah

Figure 1 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding
Figure 2 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding
Figure 3 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding
Figure 4 for WALNUT: A Benchmark on Weakly Supervised Learning for Natural Language Understanding

Building quality machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been shown to provide valuable supervision when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. To date a benchmark for NLU with real world weak supervision signals for a collection of NLU tasks is still not available. In this paper, we propose such a benchmark, named WALNUT, to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including both document-level prediction tasks and token-level prediction tasks and for each task contains weak labels generated by multiple real-world weak sources. We conduct baseline evaluations on the benchmark to systematically test the value of weak supervision for NLU tasks, with various weak supervision methods and model architectures. We demonstrate the benefits of weak supervision for low-resource NLU tasks and expect WALNUT to stimulate further research on methodologies to best leverage weak supervision. The benchmark and code for baselines will be publicly available at aka.ms/walnut_benchmark.

Viaarxiv icon

Self-Training with Weak Supervision

Apr 12, 2021
Giannis Karamanolakis, Subhabrata Mukherjee, Guoqing Zheng, Ahmed Hassan Awadallah

Figure 1 for Self-Training with Weak Supervision
Figure 2 for Self-Training with Weak Supervision
Figure 3 for Self-Training with Weak Supervision
Figure 4 for Self-Training with Weak Supervision

State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domain-specific rules has been shown to be useful in such settings to automatically generate weakly labeled training data. However, learning with weak rules is challenging due to their inherent heuristic and noisy nature. An additional challenge is rule coverage and overlap, where prior work on weak supervision only considers instances that are covered by weak rules, thus leaving valuable unlabeled data behind. In this work, we develop a weak supervision framework (ASTRA) that leverages all the available data for a given task. To this end, we leverage task-specific unlabeled data through self-training with a model (student) that considers contextualized representations and predicts pseudo-labels for instances that may not be covered by weak rules. We further develop a rule attention network (teacher) that learns how to aggregate student pseudo-labels with weak rule labels, conditioned on their fidelity and the underlying context of an instance. Finally, we construct a semi-supervised learning objective for end-to-end training with unlabeled data, domain-specific rules, and a small amount of labeled data. Extensive experiments on six benchmark datasets for text classification demonstrate the effectiveness of our approach with significant improvements over state-of-the-art baselines.

* Accepted to NAACL 2021 (Long Paper) 
Viaarxiv icon

Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only

Oct 11, 2020
Ziyi Liu, Giannis Karamanolakis, Daniel Hsu, Luis Gravano

Figure 1 for Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only
Figure 2 for Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only
Figure 3 for Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only
Figure 4 for Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only

Health departments have been deploying text classification systems for the early detection of foodborne illness complaints in social media documents such as Yelp restaurant reviews. Current systems have been successfully applied for documents in English and, as a result, a promising direction is to increase coverage and recall by considering documents in additional languages, such as Spanish or Chinese. Training previous systems for more languages, however, would be expensive, as it would require the manual annotation of many documents for each new target language. To address this challenge, we consider cross-lingual learning and train multilingual classifiers using only the annotations for English-language reviews. Recent zero-shot approaches based on pre-trained multi-lingual BERT (mBERT) have been shown to effectively align languages for aspects such as sentiment. Interestingly, we show that those approaches are less effective for capturing the nuances of foodborne illness, our public health application of interest. To improve performance without extra annotations, we create artificial training documents in the target language through machine translation and train mBERT jointly for the source (English) and target language. Furthermore, we show that translating labeled documents to multiple languages leads to additional performance improvements for some target languages. We demonstrate the benefits of our approach through extensive experiments with Yelp restaurant reviews in seven languages. Our classifiers identify foodborne illness complaints in multilingual reviews from the Yelp Challenge dataset, which highlights the potential of our general approach for deployment in health departments.

* Accepted for the 11th International Workshop on Health Text Mining and Information Analysis (LOUHI@EMNLP 2020) 
Viaarxiv icon

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

Oct 06, 2020
Giannis Karamanolakis, Daniel Hsu, Luis Gravano

Figure 1 for Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher
Figure 2 for Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher
Figure 3 for Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher
Figure 4 for Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method, CLTS, that generates "weak" supervision in the target language using minimal cross-lingual resources, in the form of a small number of word translations. Given a limited translation budget, CLTS extracts and transfers only the most important task-specific seed words across languages and initializes a teacher classifier based on the translated seed words. Then, CLTS iteratively trains a more powerful student that also exploits the context of the seed words in unlabeled target documents and outperforms the teacher. CLTS is simple and surprisingly effective in 18 diverse languages: by transferring just 20 seed words, even a bag-of-words logistic regression student outperforms state-of-the-art cross-lingual methods (e.g., based on multilingual BERT). Moreover, CLTS can accommodate any type of student classifier: leveraging a monolingual BERT student leads to further improvements and outperforms even more expensive approaches by up to 12% in accuracy. Finally, CLTS addresses emerging tasks in low-resource languages using just a small number of word translations.

* Accepted to Findings of EMNLP 2020 (Long Paper) 
Viaarxiv icon

AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types

Jun 24, 2020
Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, Saurabh Deshpande, Alexandre Michetti Manduca, Jay Ren, Surender Pal Singh, Fan Xiao, Haw-Shiuan Chang, Giannis Karamanolakis, Yuning Mao, Yaqing Wang, Christos Faloutsos, Andrew McCallum, Jiawei Han

Figure 1 for AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types
Figure 2 for AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types
Figure 3 for AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types
Figure 4 for AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types

Can one build a knowledge graph (KG) for all products in the world? Knowledge graphs have firmly established themselves as valuable sources of information for search and question answering, and it is natural to wonder if a KG can contain information about products offered at online retail sites. There have been several successful examples of generic KGs, but organizing information about products poses many additional challenges, including sparsity and noise of structured data for products, complexity of the domain with millions of product types and thousands of attributes, heterogeneity across large number of categories, as well as large and constantly growing number of products. We describe AutoKnow, our automatic (self-driving) system that addresses these challenges. The system includes a suite of novel techniques for taxonomy construction, product property identification, knowledge extraction, anomaly detection, and synonym discovery. AutoKnow is (a) automatic, requiring little human intervention, (b) multi-scalable, scalable in multiple dimensions (many domains, many products, and many attributes), and (c) integrative, exploiting rich customer behavior logs. AutoKnow has been operational in collecting product knowledge for over 11K product types.

* KDD 2020 
Viaarxiv icon

TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories

May 01, 2020
Giannis Karamanolakis, Jun Ma, Xin Luna Dong

Figure 1 for TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories
Figure 2 for TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories
Figure 3 for TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories
Figure 4 for TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories

Extracting structured knowledge from product profiles is crucial for various applications in e-Commerce. State-of-the-art approaches for knowledge extraction were each designed for a single category of product, and thus do not apply to real-life e-Commerce scenarios, which often contain thousands of diverse categories. This paper proposes TXtract, a taxonomy-aware knowledge extraction model that applies to thousands of product categories organized in a hierarchical taxonomy. Through category conditional self-attention and multi-task learning, our approach is both scalable, as it trains a single model for thousands of categories, and effective, as it extracts category-specific attribute values. Experiments on products from a taxonomy with 4,000 categories show that TXtract outperforms state-of-the-art approaches by up to 10% in F1 and 15% in coverage across all categories.

* Accepted to ACL 2020 (Long Paper) 
Viaarxiv icon

Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health

Sep 30, 2019
Giannis Karamanolakis, Daniel Hsu, Luis Gravano

Figure 1 for Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health
Figure 2 for Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health
Figure 3 for Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health
Figure 4 for Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health

In many review classification applications, a fine-grained analysis of the reviews is desirable, because different segments (e.g., sentences) of a review may focus on different aspects of the entity in question. However, training supervised models for segment-level classification requires segment labels, which may be more difficult or expensive to obtain than review labels. In this paper, we employ Multiple Instance Learning (MIL) and use only weak supervision in the form of a single label per review. First, we show that when inappropriate MIL aggregation functions are used, then MIL-based networks are outperformed by simpler baselines. Second, we propose a new aggregation function based on the sigmoid attention mechanism and show that our proposed model outperforms the state-of-the-art models for segment-level sentiment classification (by up to 9.8% in F1). Finally, we highlight the importance of fine-grained predictions in an important public-health application: finding actionable reports of foodborne illness. We show that our model achieves 48.6% higher recall compared to previous models, thus increasing the chance of identifying previously unknown foodborne outbreaks.

* Accepted for the 5th Workshop on Noisy User-generated Text (W-NUT 2019), held in conjunction with EMNLP 2019 
Viaarxiv icon

Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training

Sep 01, 2019
Giannis Karamanolakis, Daniel Hsu, Luis Gravano

Figure 1 for Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training
Figure 2 for Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training
Figure 3 for Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training
Figure 4 for Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training

User-generated reviews can be decomposed into fine-grained segments (e.g., sentences, clauses), each evaluating a different aspect of the principal entity (e.g., price, quality, appearance). Automatically detecting these aspects can be useful for both users and downstream opinion mining applications. Current supervised approaches for learning aspect classifiers require many fine-grained aspect labels, which are labor-intensive to obtain. And, unfortunately, unsupervised topic models often fail to capture the aspects of interest. In this work, we consider weakly supervised approaches for training aspect classifiers that only require the user to provide a small set of seed words (i.e., weakly positive indicators) for the aspects of interest. First, we show that current weakly supervised approaches do not effectively leverage the predictive power of seed words for aspect detection. Next, we propose a student-teacher approach that effectively leverages seed words in a bag-of-words classifier (teacher); in turn, we use the teacher to train a second model (student) that is potentially more powerful (e.g., a neural network that uses pre-trained word embeddings). Finally, we show that iterative co-training can be used to cope with noisy seed words, leading to both improved teacher and student models. Our proposed approach consistently outperforms previous weakly supervised approaches (by 14.1 absolute F1 points on average) in six different domains of product reviews and six multilingual datasets of restaurant reviews.

* Accepted to EMNLP 2019 
Viaarxiv icon

Item Recommendation with Variational Autoencoders and Heterogenous Priors

Oct 07, 2018
Giannis Karamanolakis, Kevin Raji Cherian, Ananth Ravi Narayan, Jie Yuan, Da Tang, Tony Jebara

Figure 1 for Item Recommendation with Variational Autoencoders and Heterogenous Priors
Figure 2 for Item Recommendation with Variational Autoencoders and Heterogenous Priors
Figure 3 for Item Recommendation with Variational Autoencoders and Heterogenous Priors
Figure 4 for Item Recommendation with Variational Autoencoders and Heterogenous Priors

In recent years, Variational Autoencoders (VAEs) have been shown to be highly effective in both standard collaborative filtering applications and extensions such as incorporation of implicit feedback. We extend VAEs to collaborative filtering with side information, for instance when ratings are combined with explicit text feedback from the user. Instead of using a user-agnostic standard Gaussian prior, we incorporate user-dependent priors in the latent VAE space to encode users' preferences as functions of the review text. Taking into account both the rating and the text information to represent users in this multimodal latent space is promising to improve recommendation quality. Our proposed model is shown to outperform the existing VAE models for collaborative filtering (up to 29.41% relative improvement in ranking metric) along with other baselines that incorporate both user ratings and text for item recommendation.

* Accepted for the 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), held in conjunction with the 12th ACM Conference on Recommender Systems (RecSys 2018) in Vancouver, Canada 
Viaarxiv icon