Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jackie Chi Kit Cheung

School of Computer Science -McGill University, Mila

A Controlled Reevaluation of Coreference Resolution Models

Mar 31, 2024

Ian Porada, Xiyuan Zou, Jackie Chi Kit Cheung

Figure 1 for A Controlled Reevaluation of Coreference Resolution Models

Figure 2 for A Controlled Reevaluation of Coreference Resolution Models

Figure 3 for A Controlled Reevaluation of Coreference Resolution Models

Figure 4 for A Controlled Reevaluation of Coreference Resolution Models

Abstract:All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically evaluate five CR models and control for certain design decisions including the pretrained language model used by each. When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed. Surprisingly, among encoder-based CR models, more recent models are not always more accurate, and the oldest CR model that we test generalizes the best to out-of-domain textual genres. We conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years.

* LREC-COLING 2024

Via

Access Paper or Ask Questions

Mechanisms of non-factual hallucinations in language models

Mar 27, 2024

Lei Yu, Meng Cao, Jackie Chi Kit Cheung, Yue Dong

Figure 1 for Mechanisms of non-factual hallucinations in language models

Figure 2 for Mechanisms of non-factual hallucinations in language models

Figure 3 for Mechanisms of non-factual hallucinations in language models

Figure 4 for Mechanisms of non-factual hallucinations in language models

Abstract:State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. Despite extensive efforts to detect and mitigate hallucinations, understanding their internal mechanisms remains elusive. Our study investigates the mechanistic causes of hallucination, specifically non-factual ones where the LM incorrectly predicts object attributes in response to subject-relation queries. With causal mediation analysis and embedding space projection, we identify two general mechanistic causes of hallucinations shared across LMs of various scales and designs: 1) insufficient subject attribute knowledge in lower layer MLPs, and 2) failing to select the correct object attribute in upper layer attention heads and MLPs. These two mechanisms exhibit varying degrees of subject-object association, predictive uncertainty and perturbation robustness. Additionally, we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics for the two mechanistic causes of hallucinations. We also highlight how attribution features from our causal analysis can effectively construct hallucination detectors. Our work proposes a mechanistic understanding of LM factual errors.

Via

Access Paper or Ask Questions

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

Mar 01, 2024

Maxime Darrin, Philippe Formont, Jackie Chi Kit Cheung, Pablo Piantanida

$Figure 1 for $\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation$

$Figure 2 for $\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation$

$Figure 3 for $\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation$

$Figure 4 for $\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation$

Abstract:Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.

Via

Access Paper or Ask Questions

Analyzing Task-Encoding Tokens in Large Language Models

Jan 20, 2024

Yu Bai, Heyan Huang, Cesare Spinoso-Di Piano, Marc-Antoine Rondeau, Sanxing Chen, Yang Gao, Jackie Chi Kit Cheung

Abstract:In-context learning (ICL) has become an effective solution for few-shot learning in natural language processing. Past work has found that, during this process, representations of the last prompt token are utilized to store task reasoning procedures, thereby explaining the working mechanism of in-context learning. In this paper, we seek to locate and analyze other task-encoding tokens whose representations store task reasoning procedures. Supported by experiments that ablate the representations of different token types, we find that template and stopword tokens are the most prone to be task-encoding tokens. In addition, we demonstrate experimentally that lexical cues, repetition, and text formats are the main distinguishing characteristics of these tokens. Our work provides additional insights into how large language models (LLMs) leverage task reasoning procedures in ICL and suggests that future work may involve using task-encoding tokens to improve the computational efficiency of LLMs at inference time and their ability to handle long sequences.

* Work in progress

Via

Access Paper or Ask Questions

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Nov 18, 2023

Yu Lu Liu, Meng Cao, Su Lin Blodgett, Jackie Chi Kit Cheung, Alexandra Olteanu, Adam Trischler

Figure 1 for Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Figure 2 for Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Figure 3 for Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Figure 4 for Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Abstract:AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and other responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization -- a common NLP task largely overlooked by the responsible AI community -- we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020-2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.

Via

Access Paper or Ask Questions

Successor Features for Efficient Multisubject Controlled Text Generation

Nov 03, 2023

Meng Cao, Mehdi Fatemi, Jackie Chi Kit Cheung, Samira Shabanian

Figure 1 for Successor Features for Efficient Multisubject Controlled Text Generation

Figure 2 for Successor Features for Efficient Multisubject Controlled Text Generation

Figure 3 for Successor Features for Efficient Multisubject Controlled Text Generation

Figure 4 for Successor Features for Efficient Multisubject Controlled Text Generation

Abstract:While large language models (LLMs) have achieved impressive performance in generating fluent and realistic text, controlling the generated text so that it exhibits properties such as safety, factuality, and non-toxicity remains challenging. % such as DExperts, GeDi, and rectification Existing decoding-based methods are static in terms of the dimension of control; if the target subject is changed, they require new training. Moreover, it can quickly become prohibitive to concurrently control multiple subjects. In this work, we introduce SF-GEN, which is grounded in two primary concepts: successor features (SFs) to decouple the LLM's dynamics from task-specific rewards, and language model rectification to proportionally adjust the probability of selecting a token based on the likelihood that the finished text becomes undesired. SF-GEN seamlessly integrates the two to enable dynamic steering of text generation with no need to alter the LLM's parameters. Thanks to the decoupling effect induced by successor features, our method proves to be memory-wise and computationally efficient for training as well as decoding, especially when dealing with multiple target subjects. To the best of our knowledge, our research represents the first application of successor features in text generation. In addition to its computational efficiency, the resultant language produced by our method is comparable to the SOTA (and outperforms baselines) in both control measures as well as language quality, which we demonstrate through a series of experiments in various controllable text generation tasks.

Via

Access Paper or Ask Questions

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

May 10, 2023

Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, Jackie Chi Kit Cheung

Figure 1 for Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Figure 2 for Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Figure 3 for Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Figure 4 for Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Abstract:We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.

* Findings of ACL 2023

Via

Access Paper or Ask Questions

Investigating Failures to Generalize for Coreference Resolution Models

Mar 16, 2023

Ian Porada, Alexandra Olteanu, Kaheer Suleman, Adam Trischler, Jackie Chi Kit Cheung

Figure 1 for Investigating Failures to Generalize for Coreference Resolution Models

Figure 2 for Investigating Failures to Generalize for Coreference Resolution Models

Figure 3 for Investigating Failures to Generalize for Coreference Resolution Models

Figure 4 for Investigating Failures to Generalize for Coreference Resolution Models

Abstract:Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.

Via

Access Paper or Ask Questions

Systematic Rectification of Language Models via Dead-end Analysis

Feb 27, 2023

Meng Cao, Mehdi Fatemi, Jackie Chi Kit Cheung, Samira Shabanian

Figure 1 for Systematic Rectification of Language Models via Dead-end Analysis

Figure 2 for Systematic Rectification of Language Models via Dead-end Analysis

Figure 3 for Systematic Rectification of Language Models via Dead-end Analysis

Figure 4 for Systematic Rectification of Language Models via Dead-end Analysis

Abstract:With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to reduce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. Here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. That is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. To this end, we formally extend the dead-end theory from the recent reinforcement learning (RL) literature to also cover uncertain outcomes. Our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse LLMs as long as they share the same vocabulary. Importantly, our method does not require access to the internal representations of the LLM, but only the token probability distribution at each decoding step. This is crucial as many LLMs today are hosted in servers and only accessible through APIs. When applied to various LLMs, including GPT-3, our approach significantly improves the generated discourse compared to the base LLMs and other techniques in terms of both the overall language and detoxification performance.

* ICLR 2023
* The Eleventh International Conference on Learning Representations, ICLR'23

Via

Access Paper or Ask Questions

Learning with Rejection for Abstractive Text Summarization

Feb 16, 2023

Meng Cao, Yue Dong, Jingyi He, Jackie Chi Kit Cheung

Figure 1 for Learning with Rejection for Abstractive Text Summarization

Figure 2 for Learning with Rejection for Abstractive Text Summarization

Figure 3 for Learning with Rejection for Abstractive Text Summarization

Figure 4 for Learning with Rejection for Abstractive Text Summarization

Abstract:State-of-the-art abstractive summarization systems frequently hallucinate content that is not supported by the source document, mainly due to noise in the training dataset. Existing methods opt to drop the noisy samples or tokens from the training set entirely, reducing the effective training set size and creating an artificial propensity to copy words from the source. In this work, we propose a training objective for abstractive summarization based on rejection learning, in which the model learns whether or not to reject potentially noisy tokens. We further propose a regularized decoding objective that penalizes non-factual candidate summaries during inference by using the rejection probability learned during training. We show that our method considerably improves the factuality of generated summaries in automatic and human evaluations when compared to five baseline models and that it does so while increasing the abstractiveness of the generated summaries.

Via

Access Paper or Ask Questions