Alert button
Picture for Weixin Liang

Weixin Liang

Alert button

Improving Out-of-Distribution Robustness via Selective Augmentation

Jan 02, 2022
Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, Chelsea Finn

Figure 1 for Improving Out-of-Distribution Robustness via Selective Augmentation
Figure 2 for Improving Out-of-Distribution Robustness via Selective Augmentation
Figure 3 for Improving Out-of-Distribution Robustness via Selective Augmentation
Figure 4 for Improving Out-of-Distribution Robustness via Selective Augmentation

Machine learning algorithms typically assume that training and test examples are drawn from the same distribution. However, distribution shift is a common problem in real-world applications and can cause models to perform dramatically worse at test time. In this paper, we specifically consider the problems of domain shifts and subpopulation shifts (eg. imbalanced data). While prior works often seek to explicitly regularize internal representations and predictors of the model to be domain invariant, we instead aim to regularize the whole function without restricting the model's internal representations. This leads to a simple mixup-based technique which learns invariant functions via selective augmentation called LISA. LISA selectively interpolates samples either with the same labels but different domains or with the same domain but different labels. We analyze a linear setting and theoretically show how LISA leads to a smaller worst-group error. Empirically, we study the effectiveness of LISA on nine benchmarks ranging from subpopulation shifts to domain shifts, and we find that LISA consistently outperforms other state-of-the-art methods.

Viaarxiv icon

Disparities in Dermatology AI: Assessments Using Diverse Clinical Images

Nov 15, 2021
Roxana Daneshjou, Kailas Vodrahalli, Weixin Liang, Roberto A Novoa, Melissa Jenkins, Veronica Rotemberg, Justin Ko, Susan M Swetter, Elizabeth E Bailey, Olivier Gevaert, Pritam Mukherjee, Michelle Phung, Kiana Yekrang, Bradley Fong, Rachna Sahasrabudhe, James Zou, Albert Chiou

Figure 1 for Disparities in Dermatology AI: Assessments Using Diverse Clinical Images
Figure 2 for Disparities in Dermatology AI: Assessments Using Diverse Clinical Images
Figure 3 for Disparities in Dermatology AI: Assessments Using Diverse Clinical Images

More than 3 billion people lack access to care for skin disease. AI diagnostic tools may aid in early skin cancer detection; however most models have not been assessed on images of diverse skin tones or uncommon diseases. To address this, we curated the Diverse Dermatology Images (DDI) dataset - the first publicly available, pathologically confirmed images featuring diverse skin tones. We show that state-of-the-art dermatology AI models perform substantially worse on DDI, with ROC-AUC dropping 29-40 percent compared to the models' original results. We find that dark skin tones and uncommon diseases, which are well represented in the DDI dataset, lead to performance drop-offs. Additionally, we show that state-of-the-art robust training methods cannot correct for these biases without diverse training data. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and across all disease.

* Machine Learning for Health (ML4H) - Extended Abstract 
Viaarxiv icon

HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Jun 02, 2021
Weixin Liang, Kai-Hui Liang, Zhou Yu

Figure 1 for HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations
Figure 2 for HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations
Figure 3 for HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations
Figure 4 for HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations

Open-domain dialog systems have a user-centric goal: to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86% user disengagement detection accuracy in two dialog corpora.

* ACL 2021. Code & data available at https://github.com/Weixin-Liang/HERALD/ 
Viaarxiv icon

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Apr 20, 2021
Weixin Liang, Yanhao Jiang, Zixuan Liu

Figure 1 for GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering
Figure 2 for GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering
Figure 3 for GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering
Figure 4 for GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Images are more than a collection of objects or attributes -- they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality as a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passing among graph nodes. We explore the design space of GraphVQA framework, and discuss the trade-off of different design choices. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art accuracy by a large margin (88.43% vs. 94.78%).

* NAACL 2021 MAI-Workshop. Code available at https://github.com/codexxxl/GraphVQA 
Viaarxiv icon

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

Nov 21, 2020
Weixin Liang, Feiyang Niu, Aishwarya Reganti, Govind Thattai, Gokhan Tur

Figure 1 for LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
Figure 2 for LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
Figure 3 for LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
Figure 4 for LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformulate VQA as a full answer generation task, which requires the model to justify its predictions in natural language. We propose LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRTA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates a full answer to the given question with natural language justifications. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin (43.1% v.s. 28.0%) on the full answer generation task. We also create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions for analyzing whether a model is having a smart guess with superficial data correlations. We show that LRTA makes a step towards truly understanding the question while the state-of-the-art model tends to learn superficial correlations from the training data.

* NeurIPS KR2ML 2020 
Viaarxiv icon

Neural Group Testing to Accelerate Deep Learning

Nov 21, 2020
Weixin Liang, James Zou

Figure 1 for Neural Group Testing to Accelerate Deep Learning
Figure 2 for Neural Group Testing to Accelerate Deep Learning
Figure 3 for Neural Group Testing to Accelerate Deep Learning
Figure 4 for Neural Group Testing to Accelerate Deep Learning

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters. The sheer size of these networks imposes a challenging computational burden during inference. Existing work focuses primarily on accelerating each forward pass of a neural network. Inspired by the group testing strategy for efficient disease testing, we propose neural group testing, which accelerates by testing a group of samples in one forward pass. Groups of samples that test negative are ruled out. If a group tests positive, samples in that group are then retested adaptively. A key challenge of neural group testing is to modify a deep neural network so that it could test multiple samples in one forward pass. We propose three designs to achieve this without introducing any new parameters and evaluate their performances. We applied neural group testing in an image moderation task to detect rare but inappropriate images. We found that neural group testing can group up to 16 images in one forward pass and reduce the overall computation cost by over 73% while improving detection performance.

Viaarxiv icon

ALICE: Active Learning with Contrastive Natural Language Explanations

Sep 22, 2020
Weixin Liang, James Zou, Zhou Yu

Figure 1 for ALICE: Active Learning with Contrastive Natural Language Explanations
Figure 2 for ALICE: Active Learning with Contrastive Natural Language Explanations
Figure 3 for ALICE: Active Learning with Contrastive Natural Language Explanations
Figure 4 for ALICE: Active Learning with Contrastive Natural Language Explanations

Training a supervised neural network classifier typically requires many annotated training samples. Collecting and annotating a large number of data points are costly and sometimes even infeasible. Traditional annotation process uses a low-bandwidth human-machine communication interface: classification labels, each of which only provides several bits of information. We propose Active Learning with Contrastive Explanations (ALICE), an expert-in-the-loop training framework that utilizes contrastive natural language explanations to improve data efficiency in learning. ALICE learns to first use active learning to select the most informative pairs of label classes to elicit contrastive natural language explanations from experts. Then it extracts knowledge from these explanations using a semantic parser. Finally, it incorporates the extracted knowledge through dynamically changing the learning model's structure. We applied ALICE in two visual recognition tasks, bird species classification and social relationship classification. We found by incorporating contrastive explanations, our models outperform baseline models that are trained with 40-100% more training data. We found that adding 1 explanation leads to similar performance gain as adding 13-30 labeled training data points.

* EMNLP 2020  
Viaarxiv icon

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Jun 12, 2020
Weixin Liang, James Zou, Zhou Yu

Figure 1 for Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation
Figure 2 for Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation
Figure 3 for Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation
Figure 4 for Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.

* ACL 2020  
Viaarxiv icon