Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gabriel Stanovsky

Beyond Benchmarks: On The False Promise of AI Regulation

Jan 26, 2025

Gabriel Stanovsky, Renana Keydar, Gadi Perl, Eliya Habba

Figure 1 for Beyond Benchmarks: On The False Promise of AI Regulation

Figure 2 for Beyond Benchmarks: On The False Promise of AI Regulation

Figure 3 for Beyond Benchmarks: On The False Promise of AI Regulation

Figure 4 for Beyond Benchmarks: On The False Promise of AI Regulation

Abstract:The rapid advancement of artificial intelligence (AI) systems in critical domains like healthcare, justice, and social services has sparked numerous regulatory initiatives aimed at ensuring their safe deployment. Current regulatory frameworks, exemplified by recent US and EU efforts, primarily focus on procedural guidelines while presuming that scientific benchmarking can effectively validate AI safety, similar to how crash tests verify vehicle safety or clinical trials validate drug efficacy. However, this approach fundamentally misunderstands the unique technical challenges posed by modern AI systems. Through systematic analysis of successful technology regulation case studies, we demonstrate that effective scientific regulation requires a causal theory linking observable test outcomes to future performance - for instance, how a vehicle's crash resistance at one speed predicts its safety at lower speeds. We show that deep learning models, which learn complex statistical patterns from training data without explicit causal mechanisms, preclude such guarantees. This limitation renders traditional regulatory approaches inadequate for ensuring AI safety. Moving forward, we call for regulators to reckon with this limitation, and propose a preliminary two-tiered regulatory framework that acknowledges these constraints: mandating human oversight for high-risk applications while developing appropriate risk communication strategies for lower-risk uses. Our findings highlight the urgent need to reconsider fundamental assumptions in AI regulation and suggest a concrete path forward for policymakers and researchers.

Via

Access Paper or Ask Questions

Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

Jan 08, 2025

Uri Berger, Omri Abend, Lea Frermann, Gabriel Stanovsky

Figure 1 for Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

Figure 2 for Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

Figure 3 for Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

Figure 4 for Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

Abstract:Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.

Via

Access Paper or Ask Questions

The State and Fate of Summarization Datasets

Nov 07, 2024

Noam Dahan, Gabriel Stanovsky

Figure 1 for The State and Fate of Summarization Datasets

Figure 2 for The State and Fate of Summarization Datasets

Figure 3 for The State and Fate of Summarization Datasets

Figure 4 for The State and Fate of Summarization Datasets

Abstract:Automatic summarization has consistently attracted attention, due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.

Via

Access Paper or Ask Questions

SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

Nov 05, 2024

Shlomo Neuberger, Niv Eckhaus, Uri Berger, Amir Taubenfeld, Gabriel Stanovsky, Ariel Goldstein

Figure 1 for SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

Figure 2 for SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

Figure 3 for SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

Figure 4 for SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

Abstract:Many human interactions, such as political debates, are carried out in group settings, where there are arbitrarily many participants, each with different views and agendas. To explore such complex social settings, we present SAUCE: a customizable Python platform, allowing researchers to plug-and-play various LLMs participating in discussions on any topic chosen by the user. Our platform takes care of instantiating the models, scheduling their responses, managing the discussion history, and producing a comprehensive output log, all customizable through configuration files, requiring little to no coding skills. A novel feature of SAUCE is our asynchronous communication feature, where models decide when to speak in addition to what to say, thus modeling an important facet of human communication. We show SAUCE's attractiveness in two initial experiments, and invite the community to use it in simulating various group simulations.

* https://github.com/Deep-Cognition-Lab/SAUCE

Via

Access Paper or Ask Questions

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Oct 26, 2024

Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel Goldstein

Figure 1 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Figure 2 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Figure 3 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Figure 4 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Abstract:Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

Via

Access Paper or Ask Questions

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method

Aug 09, 2024

Uri Berger, Gabriel Stanovsky, Omri Abend, Lea Frermann

Abstract:The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers. We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human judgements. Instead, we propose EnsembEval -- an ensemble of evaluation methods achieving the highest reported correlation with human judgements across 5 image captioning datasets, showing there is a lot of room for improvement by leveraging a diverse set of metrics.

Via

Access Paper or Ask Questions

SEAM: A Stochastic Benchmark for Multi-Document Tasks

Jun 23, 2024

Gili Lior, Avi Caciularu, Arie Cattan, Shahar Levy, Ori Shapira, Gabriel Stanovsky

Figure 1 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Figure 2 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Figure 3 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Figure 4 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Abstract:Various tasks, such as summarization, multi-hop question answering, or coreference resolution, are naturally phrased over collections of real-world documents. Such tasks present a unique set of challenges, revolving around the lack of coherent narrative structure across documents, which often leads to contradiction, omission, or repetition of information. Despite their real-world application and challenging properties, there is currently no benchmark which specifically measures the abilities of large language models (LLMs) on multi-document tasks. To bridge this gap, we present SEAM (a Stochastic Evaluation Approach for Multi-document tasks), a conglomerate benchmark over a diverse set of multi-document datasets, setting conventional evaluation criteria, input-output formats, and evaluation protocols. In particular, SEAM addresses the sensitivity of LLMs to minor prompt variations through repeated evaluations, where in each evaluation we sample uniformly at random the values of arbitrary factors (e.g., the order of documents). We evaluate different LLMs on SEAM finding that multi-document tasks pose a significant challenge for LLMs, even for state-of-the-art models with 70B parameters. In addition, we show that the stochastic approach uncovers underlying statistical trends which cannot be observed in a static benchmark. We hope that SEAM will spur progress via consistent and meaningful evaluation of multi-document tasks.

Via

Access Paper or Ask Questions

In-Context Learning on a Budget: A Case Study in Named Entity Recognition

Jun 19, 2024

Uri Berger, Tal Baumel, Gabriel Stanovsky

Abstract:Few shot in-context learning (ICL) typically assumes access to large annotated training sets. However, in many real world scenarios, such as domain adaptation, there is only a limited budget to annotate a small number of samples, with the goal of maximizing downstream performance. We study various methods for selecting samples to annotate within a predefined budget, specifically focusing on the named entity recognition (NER) task, which has real-world applications, is expensive to annotate, and is relatively less studied in ICL setups. Across different models and datasets, we find that a relatively small pool of annotated samples can achieve results comparable to using the entire training set. Moreover, we discover that random selection of samples for annotation yields surprisingly good performance. Finally, we observe that a diverse annotation pool is correlated with improved performance. We hope that future work adopts our realistic paradigm which takes annotation budget into account.

Via

Access Paper or Ask Questions

Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation

Jun 02, 2024

Bar Iluz, Yanai Elazar, Asaf Yehudai, Gabriel Stanovsky

Figure 1 for Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation

Figure 2 for Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation

Figure 3 for Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation

Figure 4 for Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation

Abstract:Most works on gender bias focus on intrinsic bias -- removing traces of information about a protected group from the model's internal representation. However, these works are often disconnected from the impact of such debiasing on downstream applications, which is the main motivation for debiasing in the first place. In this work, we systematically test how methods for intrinsic debiasing affect neural machine translation models, by measuring the extrinsic bias of such systems under different design choices. We highlight three challenges and mismatches between the debiasing techniques and their end-goal usage, including the choice of embeddings to debias, the mismatch between words and sub-word tokens debiasing, and the effect on different target languages. We find that these considerations have a significant impact on downstream performance and the success of debiasing.

Via

Access Paper or Ask Questions

A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns

May 23, 2024

Asaf Yehudai, Taelin Karidi, Gabriel Stanovsky, Ariel Goldstein, Omri Abend

Abstract:Cross-domain alignment refers to the task of mapping a concept from one domain to another. For example, ``If a \textit{doctor} were a \textit{color}, what color would it be?''. This seemingly peculiar task is designed to investigate how people represent concrete and abstract concepts through their mappings between categories and their reasoning processes over those mappings. In this paper, we adapt this task from cognitive science to evaluate the conceptualization and reasoning abilities of large language models (LLMs) through a behavioral study. We examine several LLMs by prompting them with a cross-domain mapping task and analyzing their responses at both the population and individual levels. Additionally, we assess the models' ability to reason about their predictions by analyzing and categorizing their explanations for these mappings. The results reveal several similarities between humans' and models' mappings and explanations, suggesting that models represent concepts similarly to humans. This similarity is evident not only in the model representation but also in their behavior. Furthermore, the models mostly provide valid explanations and deploy reasoning paths that are similar to those of humans.

* CogSci

Via

Access Paper or Ask Questions