Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kathleen McKeown

Columbia University

MASIVE: Open-Ended Affective State Identification in English and Spanish

Jul 16, 2024

Nicholas Deas, Elsbeth Turcan, Iván Pérez Mejía, Kathleen McKeown

Abstract:In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of \textit{affective states}, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of \textit{affective state identification} for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.

Via

Access Paper or Ask Questions

STORYSUMM: Evaluating Faithfulness in Story Summarization

Jul 09, 2024

Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia B. Chilton, Kathleen McKeown

Figure 1 for STORYSUMM: Evaluating Faithfulness in Story Summarization

Figure 2 for STORYSUMM: Evaluating Faithfulness in Story Summarization

Figure 3 for STORYSUMM: Evaluating Faithfulness in Story Summarization

Figure 4 for STORYSUMM: Evaluating Faithfulness in Story Summarization

Abstract:Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

Via

Access Paper or Ask Questions

Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Jul 04, 2024

Shmuel Berman, Baishakhi Ray, Kathleen McKeown

Figure 1 for Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Figure 2 for Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Figure 3 for Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Figure 4 for Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Abstract:Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.

Via

Access Paper or Ask Questions

TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings

Jun 21, 2024

Zachary Horvitz, Ajay Patel, Kanishk Singh, Chris Callison-Burch, Kathleen McKeown, Zhou Yu

Figure 1 for TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings

Figure 2 for TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings

Figure 3 for TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings

Figure 4 for TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings

Abstract:The goal of text style transfer is to transform the style of texts while preserving their original meaning, often with only a few examples of the target style. Existing style transfer methods generally rely on the few-shot capabilities of large language models or on complex controllable text generation approaches that are inefficient and underperform on fluency metrics. We introduce TinyStyler, a lightweight but effective approach, which leverages a small language model (800M params) and pre-trained authorship embeddings to perform efficient, few-shot text style transfer. We evaluate on the challenging task of authorship style transfer and find TinyStyler outperforms strong approaches such as GPT-4. We also evaluate TinyStyler's ability to perform text attribute style transfer (formal $\leftrightarrow$ informal) with automatic and human evaluations and find that the approach outperforms recent controllable text generation methods. Our model has been made publicly available at https://huggingface.co/tinystyler/tinystyler .

Via

Access Paper or Ask Questions

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Jun 17, 2024

Amith Ananthram, Elias Stengel-Eskin, Carl Vondrick, Mohit Bansal, Kathleen McKeown

Figure 1 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Figure 2 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Figure 3 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Figure 4 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Abstract:Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.

* 17 pages, 7 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective

Via

Access Paper or Ask Questions

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Mar 02, 2024

Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown

Abstract:We evaluate recent Large language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle to interpret difficult subtext. However, at their best, the models can provide thoughtful thematic analysis of stories. We additionally demonstrate that LLM judgments of summary quality do not match the feedback from the writers.

Via

Access Paper or Ask Questions

NewsQs: Multi-Source Question Generation for the Inquiring Mind

Feb 28, 2024

Alyssa Hwang, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba, Vittorio Castelli, Markus Dreyer, Mohit Bansal, Kathleen McKeown

Figure 1 for NewsQs: Multi-Source Question Generation for the Inquiring Mind

Figure 2 for NewsQs: Multi-Source Question Generation for the Inquiring Mind

Figure 3 for NewsQs: Multi-Source Question Generation for the Inquiring Mind

Figure 4 for NewsQs: Multi-Source Question Generation for the Inquiring Mind

Abstract:We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.

* in submission

Via

Access Paper or Ask Questions

Social Orientation: A New Feature for Dialogue Analysis

Feb 26, 2024

Todd Morrill, Zhaoyuan Deng, Yanda Chen, Amith Ananthram, Colin Wayne Leach, Kathleen McKeown

Figure 1 for Social Orientation: A New Feature for Dialogue Analysis

Figure 2 for Social Orientation: A New Feature for Dialogue Analysis

Figure 3 for Social Orientation: A New Feature for Dialogue Analysis

Figure 4 for Social Orientation: A New Feature for Dialogue Analysis

Abstract:There are many settings where it is useful to predict and explain the success or failure of a dialogue. Circumplex theory from psychology models the social orientations (e.g., Warm-Agreeable, Arrogant-Calculating) of conversation participants and can be used to predict and explain the outcome of social interactions. Our work is novel in its systematic application of social orientation tags to modeling conversation outcomes. In this paper, we introduce a new data set of dialogue utterances machine-labeled with social orientation tags. We show that social orientation tags improve task performance, especially in low-resource settings, on both English and Chinese language benchmarks. We also demonstrate how social orientation tags help explain the outcomes of social interactions when used in neural models. Based on these results showing the utility of social orientation tags for dialogue outcome prediction tasks, we release our data sets, code, and models that are fine-tuned to predict social orientation tags on dialogue utterances.

* Accepted to LREC-COLING 2024

Via

Access Paper or Ask Questions

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

Feb 23, 2024

Zachary Horvitz, Jingru Chen, Rahul Aditya, Harshvardhan Srivastava, Robert West, Zhou Yu, Kathleen McKeown

Abstract:Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to `unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.

Via

Access Paper or Ask Questions

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Feb 20, 2024

Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su(+4 more)

Figure 1 for TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Figure 2 for TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Figure 3 for TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Figure 4 for TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Abstract:Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

* Linguistic annotations available at https://github.com/amazon-science/tofueval

Via

Access Paper or Ask Questions