Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mrinmaya Sachan

DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation

Jun 20, 2024

Jingwei Ni, Tobias Schimanski, Meihong Lin, Mrinmaya Sachan, Elliott Ash, Markus Leippold

Figure 1 for DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation

Figure 2 for DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation

Figure 3 for DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation

Figure 4 for DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation

Abstract:Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information or excessively include irrelevant information? To allay these concerns, it is necessary to annotate domain-specific benchmarks to evaluate information retrieval (IR) performance, as relevance definitions vary across queries and domains. Furthermore, such benchmarks should be cost-efficiently annotated to avoid annotation selection bias. In this paper, we propose DIRAS (Domain-specific Information Retrieval Annotation with Scalability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to annotate relevance labels with calibrated relevance probabilities. Extensive evaluation shows that DIRAS fine-tuned models achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.

Via

Access Paper or Ask Questions

AI-Assisted Human Evaluation of Machine Translation

Jun 18, 2024

Vilém Zouhar, Tom Kocmi, Mrinmaya Sachan

Figure 1 for AI-Assisted Human Evaluation of Machine Translation

Figure 2 for AI-Assisted Human Evaluation of Machine Translation

Figure 3 for AI-Assisted Human Evaluation of Machine Translation

Figure 4 for AI-Assisted Human Evaluation of Machine Translation

Abstract:Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires detailed human labor. The recently proposed annotation protocol, Error Span Annotation (ESA), has annotators marking erroneous parts of the translation. In our work, we help the annotators by pre-filling the span annotations with automatic quality estimation. With AI assistance, we obtain more detailed annotations while cutting down the time per span annotation by half (71s/error span $\rightarrow$ 31s/error span). The biggest advantage of ESA$^\mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score as opposed to starting from scratch. In addition, the annotation budget can be reduced by up to 24% with filtering of examples that the AI deems to be very likely to be correct.

Via

Access Paper or Ask Questions

Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Jun 17, 2024

Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, Mariya Shmatova

Figure 1 for Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Figure 2 for Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Figure 3 for Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Figure 4 for Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

Abstract:High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but are less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

Via

Access Paper or Ask Questions

What Do Language Models Learn in Context? The Structured Task Hypothesis

Jun 06, 2024

Jiaoda Li, Yifan Hou, Mrinmaya Sachan, Ryan Cotterell

Figure 1 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 2 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 3 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Figure 4 for What Do Language Models Learn in Context? The Structured Task Hypothesis

Abstract:Large language models (LLMs) exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs' ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.

* This work is published in ACL 2024

Via

Access Paper or Ask Questions

On Affine Homotopy between Language Encoders

Jun 04, 2024

Robin SM Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady(+1 more)

Figure 1 for On Affine Homotopy between Language Encoders

Figure 2 for On Affine Homotopy between Language Encoders

Figure 3 for On Affine Homotopy between Language Encoders

Figure 4 for On Affine Homotopy between Language Encoders

Abstract:Pre-trained language encoders -- functions that represent text as vectors -- are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emph{intrinsic}, that is, task-independent, yet still be informative of \emph{extrinsic} similarity -- the performance on downstream tasks. It is common to consider two encoders similar if they are \emph{homotopic}, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emph{affine} alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.

* 10 pages

Via

Access Paper or Ask Questions

CausalQuest: Collecting Natural Causal Questions for AI Agents

May 30, 2024

Roberto Ceraolo, Dmitrii Kharlapenko, Amélie Reymond, Rada Mihalcea, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin

Figure 1 for CausalQuest: Collecting Natural Causal Questions for AI Agents

Figure 2 for CausalQuest: Collecting Natural Causal Questions for AI Agents

Figure 3 for CausalQuest: Collecting Natural Causal Questions for AI Agents

Figure 4 for CausalQuest: Collecting Natural Causal Questions for AI Agents

Abstract:Humans have an innate drive to seek out causality. Whether fuelled by curiosity or specific goals, we constantly question why things happen, how they are interconnected, and many other related phenomena. To develop AI agents capable of addressing this natural human quest for causality, we urgently need a comprehensive dataset of natural causal questions. Unfortunately, existing datasets either contain only artificially-crafted questions that do not reflect real AI usage scenarios or have limited coverage of questions from specific sources. To address this gap, we present CausalQuest, a dataset of 13,500 naturally occurring questions sourced from social networks, search engines, and AI assistants. We formalize the definition of causal questions and establish a taxonomy for finer-grained classification. Through a combined effort of human annotators and large language models (LLMs), we carefully label the dataset. We find that 42% of the questions humans ask are indeed causal, with the majority seeking to understand the causes behind given effects. Using this dataset, we train efficient classifiers (up to 2.85B parameters) for the binary task of identifying causal questions, achieving high performance with F1 scores of up to 0.877. We conclude with a rich set of future research directions that can build upon our data and models.

Via

Access Paper or Ask Questions

Implicit Personalization in Language Models: A Systematic Study

May 23, 2024

Zhijing Jin, Nils Heil, Jiarui Liu, Shehzaad Dhuliawala, Yahang Qi, Bernhard Schölkopf, Rada Mihalcea, Mrinmaya Sachan

Figure 1 for Implicit Personalization in Language Models: A Systematic Study

Figure 2 for Implicit Personalization in Language Models: A Systematic Study

Figure 3 for Implicit Personalization in Language Models: A Systematic Study

Figure 4 for Implicit Personalization in Language Models: A Systematic Study

Abstract:Implicit Personalization (IP) is a phenomenon of language models inferring a user's background from the implicit cues in the input prompts and tailoring the response based on this inference. While previous work has touched upon various instances of this problem, there lacks a unified framework to study this behavior. This work systematically studies IP through a rigorous mathematical formulation, a multi-perspective moral reasoning framework, and a set of case studies. Our theoretical foundation for IP relies on a structural causal model and introduces a novel method, indirect intervention, to estimate the causal effect of a mediator variable that cannot be directly intervened upon. Beyond the technical approach, we also introduce a set of moral reasoning principles based on three schools of moral philosophy to study when IP may or may not be ethically appropriate. Equipped with both mathematical and ethical insights, we present three diverse case studies illustrating the varied nature of the IP problem and offer recommendations for future research. Our code and data are at https://github.com/jiarui-liu/IP.

Via

Access Paper or Ask Questions

A Transformer with Stack Attention

May 07, 2024

Jiaoda Li, Jennifer C. White, Mrinmaya Sachan, Ryan Cotterell

Figure 1 for A Transformer with Stack Attention

Figure 2 for A Transformer with Stack Attention

Figure 3 for A Transformer with Stack Attention

Figure 4 for A Transformer with Stack Attention

Abstract:Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in the modeling power of transformer-based language models, we propose augmenting them with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.

* NAACL 2024

Via

Access Paper or Ask Questions

Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Apr 25, 2024

Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, Rada Mihalcea

Figure 1 for Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Figure 2 for Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Figure 3 for Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Figure 4 for Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Abstract:In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.

Via

Access Paper or Ask Questions

On the Causal Nature of Sentiment Analysis

Apr 17, 2024

Zhiheng Lyu, Zhijing Jin, Fernando Gonzalez, Rada Mihalcea, Bernhard Schoelkopf, Mrinmaya Sachan

Figure 1 for On the Causal Nature of Sentiment Analysis

Figure 2 for On the Causal Nature of Sentiment Analysis

Figure 3 for On the Causal Nature of Sentiment Analysis

Figure 4 for On the Causal Nature of Sentiment Analysis

Abstract:Sentiment analysis (SA) aims to identify the sentiment expressed in a text, such as a product review. Given a review and the sentiment associated with it, this paper formulates SA as a combination of two tasks: (1) a causal discovery task that distinguishes whether a review "primes" the sentiment (Causal Hypothesis C1), or the sentiment "primes" the review (Causal Hypothesis C2); and (2) the traditional prediction task to model the sentiment using the review as input. Using the peak-end rule in psychology, we classify a sample as C1 if its overall sentiment score approximates an average of all the sentence-level sentiments in the review, and C2 if the overall sentiment score approximates an average of the peak and end sentiments. For the prediction task, we use the discovered causal mechanisms behind the samples to improve the performance of LLMs by proposing causal prompts that give the models an inductive bias of the underlying causal graph, leading to substantial improvements by up to 32.13 F1 points on zero-shot five-class SA. Our code is at https://github.com/cogito233/causal-sa

* An enhanced version of our previous exploration in arXiv:2305.01764

Via

Access Paper or Ask Questions