Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andreas Vlachos

Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Nov 20, 2024

Rui Cao, Yuming Jiang, Michael Schlichtkrull, Andreas Vlachos

Figure 1 for Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Figure 2 for Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Figure 3 for Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Figure 4 for Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs

Abstract:Multimodal Large Language Models (MLLMs) can enhance trustworthiness by aligning with human preferences. As human preference labeling is laborious, recent works employ evaluation models for assessing MLLMs' responses, using the model-based assessments to automate preference dataset construction. This approach, however, faces challenges with MLLMs' lengthy and compositional responses, which often require diverse reasoning skills that a single evaluation model may not fully possess. Additionally, most existing methods rely on closed-source models as evaluators. To address limitations, we propose DecompGen, a decomposable framework that uses an ensemble of open-sourced expert models. DecompGen breaks down each response into atomic verification tasks, assigning each task to an appropriate expert model to generate fine-grained assessments. The DecompGen feedback is used to automatically construct our preference dataset, DGPref. MLLMs aligned with DGPref via preference learning show improvements in trustworthiness, demonstrating the effectiveness of DecompGen.

Via

Access Paper or Ask Questions

A Bayesian Optimization Approach to Machine Translation Reranking

Nov 14, 2024

Julius Cheng, Maike Züfle, Vilém Zouhar, Andreas Vlachos

Figure 1 for A Bayesian Optimization Approach to Machine Translation Reranking

Figure 2 for A Bayesian Optimization Approach to Machine Translation Reranking

Figure 3 for A Bayesian Optimization Approach to Machine Translation Reranking

Figure 4 for A Bayesian Optimization Approach to Machine Translation Reranking

Abstract:Reranking a list of candidates from a machine translation system with an external scoring model and returning the highest-scoring candidate remains a simple and effective method for improving the overall output quality. Translation scoring models continue to grow in size, with the best models being comparable to generation models. Thus, reranking can add substantial computational cost to the translation pipeline. In this work, we pose reranking as a Bayesian optimization (BayesOpt) problem. By strategically selecting candidates to score based on a balance of exploration and exploitation, we show that it is possible to find top-scoring candidates when scoring only a fraction of the candidate list. For instance, our method achieves the same CometKiwi score using only 70 scoring evaluations compared a baseline system using 180. We present a multi-fidelity setting for BayesOpt, where the candidates are first scored with a cheaper but noisier proxy scoring model, which further improves the cost-performance tradeoff when using smaller but well-trained distilled proxy scorers.

* v1: Preprint version

Via

Access Paper or Ask Questions

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Nov 08, 2024

Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

Figure 1 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Figure 2 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Figure 3 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Figure 4 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Abstract:Current automated fact-checking (AFC) approaches commonly evaluate evidence either implicitly via the predicted verdicts or by comparing retrieved evidence with a predefined closed knowledge source, such as Wikipedia. However, these methods suffer from limitations, resulting from their reliance on evaluation metrics developed for different purposes and constraints imposed by closed knowledge sources. Recent advances in natural language generation (NLG) evaluation offer new possibilities for evidence assessment. In this work, we introduce Ev2R, an evaluation framework for AFC that comprises three types of approaches for evidence evaluation: reference-based, proxy-reference, and reference-less. We evaluate their effectiveness through agreement with human ratings and adversarial tests, and demonstrate that prompt-based scorers, particularly those leveraging LLMs and reference evidence, outperform traditional evaluation approaches.

* 10 pages

Via

Access Paper or Ask Questions

TabVer: Tabular Fact Verification with Natural Logic

Nov 02, 2024

Rami Aly, Andreas Vlachos

Abstract:Fact verification on tabular evidence incentivises the use of symbolic reasoning models where a logical form is constructed (e.g. a LISP-style program), providing greater verifiability than fully neural approaches. However, these systems typically rely on well-formed tables, restricting their use in many scenarios. An emerging symbolic reasoning paradigm for textual evidence focuses on natural logic inference, which constructs proofs by modelling set-theoretic relations between a claim and its evidence in natural language. This approach provides flexibility and transparency but is less compatible with tabular evidence since the relations do not extend to arithmetic functions. We propose a set-theoretic interpretation of numerals and arithmetic functions in the context of natural logic, enabling the integration of arithmetic expressions in deterministic proofs. We leverage large language models to generate arithmetic expressions by generating questions about salient parts of a claim which are answered by executing appropriate functions on tables. In a few-shot setting on FEVEROUS, we achieve an accuracy of 71.4, outperforming both fully neural and symbolic reasoning models by 3.4 points. When evaluated on TabFact without any further training, our method remains competitive with an accuracy lead of 0.5 points.

* Accepted to TACL. This is a slightly extended version

Via

Access Paper or Ask Questions

The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Oct 31, 2024

Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, Rami Aly, Zhijiang Guo, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal(+2 more)

Figure 1 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Figure 2 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Figure 3 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Figure 4 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Abstract:The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.

Via

Access Paper or Ask Questions

Conformity in Large Language Models

Oct 16, 2024

Xiaochen Zhu, Caiqi Zhang, Tom Stafford, Nigel Collier, Andreas Vlachos

Figure 1 for Conformity in Large Language Models

Figure 2 for Conformity in Large Language Models

Figure 3 for Conformity in Large Language Models

Figure 4 for Conformity in Large Language Models

Abstract:The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in state-of-the-art LLMs. Our findings reveal that all models tested exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions--Devil's Advocate and Question Distillation--to mitigate conformity, providing insights into building more robust language models.

* 16 pages (8 pages main body), 14 figures

Via

Access Paper or Ask Questions

ALVIN: Active Learning Via INterpolation

Oct 11, 2024

Michalis Korakakis, Andreas Vlachos, Adrian Weller

Figure 1 for ALVIN: Active Learning Via INterpolation

Figure 2 for ALVIN: Active Learning Via INterpolation

Figure 3 for ALVIN: Active Learning Via INterpolation

Figure 4 for ALVIN: Active Learning Via INterpolation

Abstract:Active Learning aims to minimize annotation effort by selecting the most useful instances from a pool of unlabeled data. However, typical active learning methods overlook the presence of distinct example groups within a class, whose prevalence may vary, e.g., in occupation classification datasets certain demographics are disproportionately represented in specific classes. This oversight causes models to rely on shortcuts for predictions, i.e., spurious correlations between input attributes and labels occurring in well-represented groups. To address this issue, we propose Active Learning Via INterpolation (ALVIN), which conducts intra-class interpolations between examples from under-represented and well-represented groups to create anchors, i.e., artificial points situated between the example groups in the representation space. By selecting instances close to the anchors for annotation, ALVIN identifies informative examples exposing the model to regions of the representation space that counteract the influence of shortcuts. Crucially, since the model considers these examples to be of high certainty, they are likely to be ignored by typical active learning methods. Experimental results on six datasets encompassing sentiment analysis, natural language inference, and paraphrase detection demonstrate that ALVIN outperforms state-of-the-art active learning methods in both in-distribution and out-of-distribution generalization.

* Accepted to EMNLP 2024 (Main)

Via

Access Paper or Ask Questions

Zero-Shot Fact Verification via Natural Logic and Large Language Models

Oct 04, 2024

Marek Strong, Rami Aly, Andreas Vlachos

Abstract:The recent development of fact verification systems with natural logic has enhanced their explainability by aligning claims with evidence through set-theoretic operators, providing faithful justifications. Despite these advancements, such systems often rely on a large amount of training data annotated with natural logic. To address this issue, we propose a zero-shot method that utilizes the generalization capabilities of instruction-tuned large language models. To comprehensively assess the zero-shot capabilities of our method and other fact verification systems, we evaluate all models on both artificial and real-world claims, including multilingual datasets. We also compare our method against other fact verification systems in two setups. First, in the zero-shot generalization setup, we demonstrate that our approach outperforms other systems that were not specifically trained on natural logic data, achieving an average accuracy improvement of 8.96 points over the best-performing baseline. Second, in the zero-shot transfer setup, we show that current systems trained on natural logic data do not generalize well to other domains, and our method outperforms these systems across all datasets with real-world claims.

* Accepted to EMNLP 2024

Via

Access Paper or Ask Questions

An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Jun 20, 2024

Lexin Zhou, Youmna Farag, Andreas Vlachos

Figure 1 for An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Figure 2 for An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Figure 3 for An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Figure 4 for An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Abstract:Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.

Via

Access Paper or Ask Questions

Causal Estimation of Memorisation Profiles

Jun 06, 2024

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel

Figure 1 for Causal Estimation of Memorisation Profiles

Figure 2 for Causal Estimation of Memorisation Profiles

Figure 3 for Causal Estimation of Memorisation Profiles

Figure 4 for Causal Estimation of Memorisation Profiles

Abstract:Understanding memorisation in language models has practical and societal implications, e.g., studying models' training dynamics or preventing copyright infringements. Prior work defines memorisation as the causal effect of training with an instance on the model's ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the model not seen that instance. Existing methods struggle to provide computationally efficient and accurate estimates of this counterfactual. Further, they often estimate memorisation for a model architecture rather than for a specific model instance. This paper fills an important gap in the literature, proposing a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics. Using this method, we characterise a model's memorisation profile--its memorisation trends across training--by only observing its behaviour on a small set of instances throughout training. In experiments with the Pythia model suite, we find that memorisation (i) is stronger and more persistent in larger models, (ii) is determined by data order and learning rate, and (iii) has stable trends across model sizes, thus making memorisation in larger models predictable from smaller ones.

* Published at the ACL 2024 Conference (main)

Via

Access Paper or Ask Questions