Alert button
Picture for Xiaofei Ma

Xiaofei Ma

Alert button

Self-consistency for open-ended generations

Jul 11, 2023
Siddhartha Jain, Xiaofei Ma, Anoop Deoras, Bing Xiang

Figure 1 for Self-consistency for open-ended generations
Figure 2 for Self-consistency for open-ended generations
Figure 3 for Self-consistency for open-ended generations
Figure 4 for Self-consistency for open-ended generations

In this paper, we present a novel approach for improving the quality and consistency of generated outputs from large-scale pre-trained language models (LLMs). Self-consistency has emerged as an effective approach for prompts with fixed answers, selecting the answer with the highest number of votes. In this paper, we introduce a generalized framework for self-consistency that extends its applicability beyond problems that have fixed-answer answers. Through extensive simulations, we demonstrate that our approach consistently recovers the optimal or near-optimal generation from a set of candidates. We also propose lightweight parameter-free similarity functions that show significant and consistent improvements across code generation, autoformalization, and summarization tasks, even without access to token log probabilities. Our method incurs minimal computational overhead, requiring no auxiliary reranker models or modifications to the existing model.

Viaarxiv icon

Exploring Continual Learning for Code Generation Models

Jul 05, 2023
Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Mohit Bansal, Bing Xiang

Figure 1 for Exploring Continual Learning for Code Generation Models
Figure 2 for Exploring Continual Learning for Code Generation Models
Figure 3 for Exploring Continual Learning for Code Generation Models
Figure 4 for Exploring Continual Learning for Code Generation Models

Large-scale code generation models such as Codex and CodeT5 have achieved impressive performance. However, libraries are upgraded or deprecated very frequently and re-training large-scale language models is computationally expensive. Therefore, Continual Learning (CL) is an important aspect that remains underexplored in the code domain. In this paper, we introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement, with different input and output programming languages. Next, on our CodeTask-CL benchmark, we compare popular CL techniques from NLP and Vision domains. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism caused by stark distribution shifts in coding tasks. We address this issue with our proposed method, Prompt Pooling with Teacher Forcing (PP-TF), that stabilizes training by enforcing constraints on the prompt selection mechanism and leads to a 21.54% improvement over Prompt Pooling. Along with the benchmark, we establish a training pipeline that can be used for CL on code models, which we believe can motivate further development of CL methods for code models. Our code is available at https://github.com/amazon-science/codetaskcl-pptf

* ACL 2023 
Viaarxiv icon

Efficient Shapley Values Estimation by Amortization for Text Classification

May 31, 2023
Chenghao Yang, Fan Yin, He He, Kai-Wei Chang, Xiaofei Ma, Bing Xiang

Figure 1 for Efficient Shapley Values Estimation by Amortization for Text Classification
Figure 2 for Efficient Shapley Values Estimation by Amortization for Text Classification
Figure 3 for Efficient Shapley Values Estimation by Amortization for Text Classification
Figure 4 for Efficient Shapley Values Estimation by Amortization for Text Classification

Despite the popularity of Shapley Values in explaining neural text classification models, computing them is prohibitive for large pretrained models due to a large number of model evaluations. In practice, Shapley Values are often estimated with a small number of stochastic model evaluations. However, we show that the estimated Shapley Values are sensitive to random seed choices -- the top-ranked features often have little overlap across different seeds, especially on examples with longer input texts. This can only be mitigated by aggregating thousands of model evaluations, which on the other hand, induces substantial computational overheads. To mitigate the trade-off between stability and efficiency, we develop an amortized model that directly predicts each input feature's Shapley Value without additional model evaluations. It is trained on a set of examples whose Shapley Values are estimated from a large number of model evaluations to ensure stability. Experimental results on two text classification datasets demonstrate that our amortized model estimates Shapley Values accurately with up to 60 times speedup compared to traditional methods. Furthermore, the estimated values are stable as the inference is deterministic. We release our code at https://github.com/yangalan123/Amortized-Interpretability.

* ACL 2023 Camera Ready 
Viaarxiv icon

STREET: A Multi-Task Structured Reasoning and Explanation Benchmark

Feb 13, 2023
Danilo Ribeiro, Shen Wang, Xiaofei Ma, Henry Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica Ramos, William Wang, Zhiheng Huang, George Karypis, Bing Xiang, Dan Roth

Figure 1 for STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
Figure 2 for STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
Figure 3 for STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
Figure 4 for STREET: A Multi-Task Structured Reasoning and Explanation Benchmark

We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.

* Published in ICLR 2023 
Viaarxiv icon

SWING: Balancing Coverage and Faithfulness for Dialogue Summarization

Jan 25, 2023
Kung-Hsiang Huang, Siffi Singh, Xiaofei Ma, Wei Xiao, Feng Nan, Nicholas Dingwall, William Yang Wang, Kathleen McKeown

Figure 1 for SWING: Balancing Coverage and Faithfulness for Dialogue Summarization
Figure 2 for SWING: Balancing Coverage and Faithfulness for Dialogue Summarization
Figure 3 for SWING: Balancing Coverage and Faithfulness for Dialogue Summarization
Figure 4 for SWING: Balancing Coverage and Faithfulness for Dialogue Summarization

Missing information is a common issue of dialogue summarization where some information in the reference summaries is not covered in the generated summaries. To address this issue, we propose to utilize natural language inference (NLI) models to improve coverage while avoiding introducing factual inconsistencies. Specifically, we use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered, as well as to distinguish between factually consistent and inconsistent generated sentences. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach in balancing coverage and faithfulness, validated with automatic metrics and human evaluations. Additionally, we compute the correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency to provide insight into the most suitable metric for evaluating dialogue summaries.

* Accepted by Findings of EACL 2023 
Viaarxiv icon

ContraGen: Effective Contrastive Learning For Causal Language Model

Oct 03, 2022
Nihal Jain, Dejiao Zhang, Wasi Uddin Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Xiaofei Ma, Bing Xiang

Figure 1 for ContraGen: Effective Contrastive Learning For Causal Language Model
Figure 2 for ContraGen: Effective Contrastive Learning For Causal Language Model
Figure 3 for ContraGen: Effective Contrastive Learning For Causal Language Model
Figure 4 for ContraGen: Effective Contrastive Learning For Causal Language Model

Despite exciting progress in large-scale language generation, the expressiveness of its representations is severely limited by the \textit{anisotropy} issue where the hidden representations are distributed into a narrow cone in the vector space. To address this issue, we present ContraGen, a novel contrastive learning framework to improve the representation with better uniformity and discrimination. We assess ContraGen on a wide range of downstream tasks in natural and programming languages. We show that ContraGen can effectively enhance both uniformity and discrimination of the representations and lead to the desired improvement on various language understanding tasks where discriminative representations are crucial for attaining good performance. Specifically, we attain $44\%$ relative improvement on the Semantic Textual Similarity tasks and $34\%$ on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, ContraGen also boosts the source code generation capability with $9\%$ relative improvement on execution accuracy on the HumanEval benchmark.

* 10 pages 
Viaarxiv icon

Learning Dialogue Representations from Consecutive Utterances

May 26, 2022
Zhihan Zhou, Dejiao Zhang, Wei Xiao, Nicholas Dingwall, Xiaofei Ma, Andrew O. Arnold, Bing Xiang

Figure 1 for Learning Dialogue Representations from Consecutive Utterances
Figure 2 for Learning Dialogue Representations from Consecutive Utterances
Figure 3 for Learning Dialogue Representations from Consecutive Utterances
Figure 4 for Learning Dialogue Representations from Consecutive Utterances

Learning high-quality dialogue representations is essential for solving a variety of dialogue-oriented tasks, especially considering that dialogue systems often suffer from data scarcity. In this paper, we introduce Dialogue Sentence Embedding (DSE), a self-supervised contrastive learning method that learns effective dialogue representations suitable for a wide range of dialogue tasks. DSE learns from dialogues by taking consecutive utterances of the same dialogue as positive pairs for contrastive learning. Despite its simplicity, DSE achieves significantly better representation capability than other dialogue representation and universal sentence representation models. We evaluate DSE on five downstream dialogue tasks that examine dialogue representation at different semantic granularities. Experiments in few-shot and zero-shot settings show that DSE outperforms baselines by a large margin. For example, it achieves 13 average performance improvement over the strongest unsupervised baseline in 1-shot intent classification on 6 datasets. We also provide analyses on the benefits and limitations of our model.

* NAACL 2022 main conference 
Viaarxiv icon

Debiasing Neural Retrieval via In-batch Balancing Regularization

May 18, 2022
Yuantong Li, Xiaokai Wei, Zijian Wang, Shen Wang, Parminder Bhatia, Xiaofei Ma, Andrew Arnold

Figure 1 for Debiasing Neural Retrieval via In-batch Balancing Regularization
Figure 2 for Debiasing Neural Retrieval via In-batch Balancing Regularization
Figure 3 for Debiasing Neural Retrieval via In-batch Balancing Regularization
Figure 4 for Debiasing Neural Retrieval via In-batch Balancing Regularization

People frequently interact with information retrieval (IR) systems, however, IR models exhibit biases and discrimination towards various demographics. The in-processing fair ranking methods provide a trade-offs between accuracy and fairness through adding a fairness-related regularization term in the loss function. However, there haven't been intuitive objective functions that depend on the click probability and user engagement to directly optimize towards this. In this work, we propose the In-Batch Balancing Regularization (IBBR) to mitigate the ranking disparity among subgroups. In particular, we develop a differentiable \textit{normed Pairwise Ranking Fairness} (nPRF) and leverage the T-statistics on top of nPRF over subgroups as a regularization to improve fairness. Empirical results with the BERT-based neural rankers on the MS MARCO Passage Retrieval dataset with the human-annotated non-gendered queries benchmark \citep{rekabsaz2020neural} show that our IBBR method with nPRF achieves significantly less bias with minimal degradation in ranking performance compared with the baseline.

* 9 pages, 1 figure, and 3 tables. A version appears in the Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), 2022 
Viaarxiv icon

Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner

May 18, 2022
Danilo Ribeiro, Shen Wang, Xiaofei Ma, Rui Dong, Xiaokai Wei, Henry Zhu, Xinchi Chen, Zhiheng Huang, Peng Xu, Andrew Arnold, Dan Roth

Figure 1 for Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner
Figure 2 for Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner
Figure 3 for Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner
Figure 4 for Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner

Large language models have achieved high performance on various question answering (QA) benchmarks, but the explainability of their output remains elusive. Structured explanations, called entailment trees, were recently suggested as a way to explain and inspect a QA system's answer. In order to better generate such entailment trees, we propose an architecture called Iterative Retrieval-Generation Reasoner (IRGR). Our model is able to explain a given hypothesis by systematically generating a step-by-step explanation from textual premises. The IRGR model iteratively searches for suitable premises, constructing a single entailment step at a time. Contrary to previous approaches, our method combines generation steps and retrieval of premises, allowing the model to leverage intermediate conclusions, and mitigating the input size limit of baseline encoder-decoder models. We conduct experiments using the EntailmentBank dataset, where we outperform existing benchmarks on premise retrieval and entailment tree generation, with around 300% gain in overall correctness.

* published in NAACL 2022 
Viaarxiv icon

Robust Information Retrieval for False Claims with Distracting Entities In Fact Extraction and Verification

Dec 10, 2021
Mingwen Dong, Christos Christodoulopoulos, Sheng-Min Shih, Xiaofei Ma

Figure 1 for Robust Information Retrieval for False Claims with Distracting Entities In Fact Extraction and Verification
Figure 2 for Robust Information Retrieval for False Claims with Distracting Entities In Fact Extraction and Verification
Figure 3 for Robust Information Retrieval for False Claims with Distracting Entities In Fact Extraction and Verification
Figure 4 for Robust Information Retrieval for False Claims with Distracting Entities In Fact Extraction and Verification

Accurate evidence retrieval is essential for automated fact checking. Little previous research has focused on the differences between true and false claims and how they affect evidence retrieval. This paper shows that, compared with true claims, false claims more frequently contain irrelevant entities which can distract evidence retrieval model. A BERT-based retrieval model made more mistakes in retrieving refuting evidence for false claims than supporting evidence for true claims. When tested with adversarial false claims (synthetically generated) containing irrelevant entities, the recall of the retrieval model is significantly lower than that for original claims. These results suggest that the vanilla BERT-based retrieval model is not robust to irrelevant entities in the false claims. By augmenting the training data with synthetic false claims containing irrelevant entities, the trained model achieved higher evidence recall, including that of false claims with irrelevant entities. In addition, using separate models to retrieve refuting and supporting evidence and then aggregating them can also increase the evidence recall, including that of false claims with irrelevant entities. These results suggest that we can increase the BERT-based retrieval model's robustness to false claims with irrelevant entities via data augmentation and model ensemble.

Viaarxiv icon