Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Omri Abend

CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

Jul 24, 2024

Itamar Trainin, Omri Abend

Figure 1 for CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

Figure 2 for CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

Figure 3 for CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

Figure 4 for CovScore: Evaluation of Multi-Document Abstractive Title Set Generation

Abstract:This paper introduces CovScore, an automatic reference-less methodology for evaluating thematic title sets, extracted from a corpus of documents. While such extraction methods are widely used, evaluating their effectiveness remains an open question. Moreover, some existing practices heavily rely on slow and laborious human annotation procedures. Inspired by recently introduced LLM-based judge methods, we propose a novel methodology that decomposes quality into five main metrics along different aspects of evaluation. This framing simplifies and expedites the manual evaluation process and enables automatic and independent LLM-based evaluation. As a test case, we apply our approach to a corpus of Holocaust survivor testimonies, motivated both by its relevance to title set extraction and by the moral significance of this pursuit. We validate the methodology by experimenting with naturalistic and synthetic title set generation systems and compare their performance with the methodology.

Via

Access Paper or Ask Questions

Learning from Naturally Occurring Feedback

Jul 15, 2024

Shachar Don-Yehiya, Leshem Choshen, Omri Abend

Abstract:Human feedback data is a critical component in developing language models. However, collecting this feedback is costly and ultimately not scalable. We propose a scalable method for extracting feedback that users naturally include when interacting with chat models, and leveraging it for model training. We are further motivated by previous work that showed there are also qualitative advantages to using naturalistic (rather than auto-generated) feedback, such as less hallucinations and biases. We manually annotated conversation data to confirm the presence of naturally occurring feedback in a standard corpus, finding that as much as 30% of the chats include explicit feedback. We apply our method to over 1M conversations to obtain hundreds of thousands of feedback samples. Training with the extracted feedback shows significant performance improvements over baseline models, demonstrating the efficacy of our approach in enhancing model alignment to human preferences.

Via

Access Paper or Ask Questions

A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns

May 23, 2024

Asaf Yehudai, Taelin Karidi, Gabriel Stanovsky, Ariel Goldstein, Omri Abend

Figure 1 for A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns

Figure 2 for A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns

Figure 3 for A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns

Figure 4 for A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns

Abstract:Cross-domain alignment refers to the task of mapping a concept from one domain to another. For example, ``If a \textit{doctor} were a \textit{color}, what color would it be?''. This seemingly peculiar task is designed to investigate how people represent concrete and abstract concepts through their mappings between categories and their reasoning processes over those mappings. In this paper, we adapt this task from cognitive science to evaluate the conceptualization and reasoning abilities of large language models (LLMs) through a behavioral study. We examine several LLMs by prompting them with a cross-domain mapping task and analyzing their responses at both the population and individual levels. Additionally, we assess the models' ability to reason about their predictions by analyzing and categorizing their explanations for these mappings. The results reveal several similarities between humans' and models' mappings and explanations, suggesting that models represent concepts similarly to humans. This similarity is evident not only in the model representation but also in their behavior. Furthermore, the models mostly provide valid explanations and deploy reasoning paths that are similar to those of humans.

* CogSci

Via

Access Paper or Ask Questions

Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

May 04, 2024

Maxim Ifergan, Renana Keydar, Omri Abend, Amit Pinchevski

Figure 1 for Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Figure 2 for Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Figure 3 for Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Figure 4 for Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Abstract:The vast collection of Holocaust survivor testimonies presents invaluable historical insights but poses challenges for manual analysis. This paper leverages advanced Natural Language Processing (NLP) techniques to explore the USC Shoah Foundation Holocaust testimony corpus. By treating testimonies as structured question-and-answer sections, we apply topic modeling to identify key themes. We experiment with BERTopic, which leverages recent advances in language modeling technology. We align testimony sections into fixed parts, revealing the evolution of topics across the corpus of testimonies. This highlights both a common narrative schema and divergences between subgroups based on age and gender. We introduce a novel method to identify testimonies within groups that exhibit atypical topic distributions resembling those of other groups. This study offers unique insights into the complex narratives of Holocaust survivors, demonstrating the power of NLP to illuminate historical discourse and identify potential deviations in survivor experiences.

* 9 pages, 7 figures, LREC-COLING 2024

Via

Access Paper or Ask Questions

Jamba: A Hybrid Transformer-Mamba Language Model

Mar 28, 2024

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz(+12 more)

Figure 1 for Jamba: A Hybrid Transformer-Mamba Language Model

Figure 2 for Jamba: A Hybrid Transformer-Mamba Language Model

Figure 3 for Jamba: A Hybrid Transformer-Mamba Language Model

Figure 4 for Jamba: A Hybrid Transformer-Mamba Language Model

Abstract:We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

* Webpage: https://www.ai21.com/jamba

Via

Access Paper or Ask Questions

Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

Nov 20, 2023

Shachar Don-Yehiya, Leshem Choshen, Omri Abend

Figure 1 for Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

Figure 2 for Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

Figure 3 for Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

Figure 4 for Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

Abstract:Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model's ``preferences'', producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model's preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.

* EMNLP23

Via

Access Paper or Ask Questions

Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Oct 20, 2023

Ofir Arviv, Dmitry Nikolaev, Taelin Karidi, Omri Abend

Figure 1 for Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Figure 2 for Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Figure 3 for Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Figure 4 for Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Abstract:Despite the impressive growth of the abilities of multilingual language models, such as XLM-R and mT5, it has been shown that they still face difficulties when tackling typologically-distant languages, particularly in the low-resource setting. One obstacle for effective cross-lingual transfer is variability in word-order patterns. It can be potentially mitigated via source- or target-side word reordering, and numerous approaches to reordering have been proposed. However, they rely on language-specific rules, work on the level of POS tags, or only target the main clause, leaving subordinate clauses intact. To address these limitations, we present a new powerful reordering method, defined in terms of Universal Dependencies, that is able to learn fine-grained word-order patterns conditioned on the syntactic context from a small amount of annotated data and can be applied at all levels of the syntactic tree. We conduct experiments on a diverse set of tasks and show that our method consistently outperforms strong baselines over different language pairs and model architectures. This performance advantage holds true in both zero-shot and few-shot scenarios.

* Accepted to EMNLP Findings 2023

Via

Access Paper or Ask Questions

Generating Benchmarks for Factuality Evaluation of Language Models

Jul 13, 2023

Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham

Figure 1 for Generating Benchmarks for Factuality Evaluation of Language Models

Figure 2 for Generating Benchmarks for Factuality Evaluation of Language Models

Figure 3 for Generating Benchmarks for Factuality Evaluation of Language Models

Figure 4 for Generating Benchmarks for Factuality Evaluation of Language Models

Abstract:Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing factual generation evaluation methods focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent rare and unlikely facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create two benchmarks: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score correlates with perplexity, but the two metrics do not always agree on model ranking; and (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.

Via

Access Paper or Ask Questions

MuLER: Detailed and Scalable Reference-based Evaluation

May 24, 2023

Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend

Figure 1 for MuLER: Detailed and Scalable Reference-based Evaluation

Figure 2 for MuLER: Detailed and Scalable Reference-based Evaluation

Figure 3 for MuLER: Detailed and Scalable Reference-based Evaluation

Figure 4 for MuLER: Detailed and Scalable Reference-based Evaluation

Abstract:We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

Via

Access Paper or Ask Questions

Evaluating and Improving the Coreference Capabilities of Machine Translation Models

Feb 16, 2023

Asaf Yehudai, Arie Cattan, Omri Abend, Gabriel Stanovsky

Abstract:Machine translation (MT) requires a wide range of linguistic capabilities, which current end-to-end models are expected to learn implicitly by observing aligned sentences in bilingual corpora. In this work, we ask: \emph{How well do MT models learn coreference resolution from implicit signal?} To answer this question, we develop an evaluation methodology that derives coreference clusters from MT output and evaluates them without requiring annotations in the target language. We further evaluate several prominent open-source and commercial MT systems, translating from English to six target languages, and compare them to state-of-the-art coreference resolvers on three challenging benchmarks. Our results show that the monolingual resolvers greatly outperform MT models. Motivated by this result, we experiment with different methods for incorporating the output of coreference resolution models in MT, showing improvement over strong baselines.

* EACL paper

Via

Access Paper or Ask Questions