Alert button
Picture for Michael Witbrock

Michael Witbrock

Alert button

Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?

Nov 21, 2023
Tim Hartill, Joshua Bensemann, Michael Witbrock, Patricia J. Riddle

A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In the context of using Language Models for question-answering, discussion continues to occur as to the extent to which questions are answered through memorisation. We consider this issue for questions that would ideally be answered through reasoning over an associated context. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers. Our method is based on semantic similarity of input tokens and label tokens between training and evaluation samples. We show that our method offers advantages upon some prior approaches in that it is able to surface evaluation-train pairs that have overlap in either contiguous or discontiguous sequences of tokens. We use this method to identify unmemorisable subsets of our evaluation datasets. We train two Language Models in a multitask fashion whereby the second model differs from the first only in that it has two additional datasets added to the training regime that are designed to impart simple numerical reasoning strategies of a sort known to improve performance on some of our evaluation datasets but not on others. We then show that there is performance improvement between the two models on the unmemorisable subsets of the evaluation datasets that were expected to benefit from the additional training datasets. Specifically, performance on unmemorisable subsets of two of our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and 25.7% respectively while other evaluation datasets have no significant change in performance.

Viaarxiv icon

A Systematic Evaluation of Large Language Models on Out-of-Distribution Logical Reasoning Tasks

Oct 18, 2023
Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, Jiamou Liu

Large language models (LLMs), such as GPT-3.5 and GPT-4, have greatly advanced the performance of artificial systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness to perform logical reasoning remain under-evaluated. To probe this ability, we propose three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus", each featuring three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options are correct", and a combination of the previous two subsets. We carry out experiments on these datasets with both discriminative and generative LLMs and show that these simple tricks greatly hinder the performance of the language models. Despite their superior performance on the original publicly available datasets, we find that all models struggle to answer our newly constructed datasets. We show that introducing task variations by perturbing a sizable training set can markedly improve the model's generalisation and robustness in logical reasoning tasks. Moreover, applying logic-driven data augmentation for fine-tuning, combined with prompting can enhance the generalisation performance of both discriminative large language models and generative large language models. These results offer insights into assessing and improving the generalisation and robustness of large language models for logical reasoning tasks. We make our source code and data publicly available \url{https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning}.

* Accepted for oral presentation at the LLM@IJCAI 2023 non-archival symposium 
Viaarxiv icon

Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models

Sep 19, 2023
Qiming Bao, Juho Leinonen, Alex Yuxuan Peng, Wanjun Zhong, Tim Pistotti, Alice Huang, Paul Denny, Michael Witbrock, Jiamou Liu

Figure 1 for Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models
Figure 2 for Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models
Figure 3 for Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models
Figure 4 for Exploring Self-Reinforcement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models

Learnersourcing involves students generating and sharing learning resources with their peers. When learnersourcing multiple-choice questions, creating explanations for the generated questions is a crucial step as it facilitates a deeper understanding of the related concepts. However, it is often difficult for students to craft effective explanations due to limited subject understanding and a tendency to merely restate the question stem, distractors, and correct answer. To help scaffold this task, in this work we propose a self-reinforcement large-language-model framework, with the goal of generating and evaluating explanations automatically. Comprising three modules, the framework generates student-aligned explanations, evaluates these explanations to ensure their quality and iteratively enhances the explanations. If an explanation's evaluation score falls below a defined threshold, the framework iteratively refines and reassesses the explanation. Importantly, our framework emulates the manner in which students compose explanations at the relevant grade level. For evaluation, we had a human subject-matter expert compare the explanations generated by students with the explanations created by the open-source large language model Vicuna-13B, a version of Vicuna-13B that had been fine-tuned using our method, and by GPT-4. We observed that, when compared to other large language models, GPT-4 exhibited a higher level of creativity in generating explanations. We also found that explanations generated by GPT-4 were ranked higher by the human expert than both those created by the other models and the original student-created explanations. Our findings represent a significant advancement in enriching the learnersourcing experience for students and enhancing the capabilities of large language models in educational applications.

* Preprint. Under review 
Viaarxiv icon

Teaching Smaller Language Models To Generalise To Unseen Compositional Questions

Aug 21, 2023
Tim Hartill, Neset Tan, Michael Witbrock, Patricia J. Riddle

Figure 1 for Teaching Smaller Language Models To Generalise To Unseen Compositional Questions
Figure 2 for Teaching Smaller Language Models To Generalise To Unseen Compositional Questions
Figure 3 for Teaching Smaller Language Models To Generalise To Unseen Compositional Questions
Figure 4 for Teaching Smaller Language Models To Generalise To Unseen Compositional Questions

We equip a smaller Language Model to generalise to answering challenging compositional questions that have not been seen in training. To do so we propose a combination of multitask supervised pretraining on up to 93 tasks designed to instill diverse reasoning abilities, and a dense retrieval system that aims to retrieve a set of evidential paragraph fragments. Recent progress in question-answering has been achieved either through prompting methods against very large pretrained Language Models in zero or few-shot fashion, or by fine-tuning smaller models, sometimes in conjunction with information retrieval. We focus on the less explored question of the extent to which zero-shot generalisation can be enabled in smaller models with retrieval against a corpus within which sufficient information to answer a particular question may not exist. We establish strong baselines in this setting for diverse evaluation datasets (StrategyQA, CommonsenseQA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.

Viaarxiv icon

Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

Aug 12, 2023
Tim Hartill, Diana Benavides-Prado, Michael Witbrock, Patricia J. Riddle

Figure 1 for Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval
Figure 2 for Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval
Figure 3 for Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval
Figure 4 for Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.

Viaarxiv icon

Answering Unseen Questions With Smaller Language\\Models Using Rationale Generation and Dense Retrieval

Aug 09, 2023
Tim Hartill, Diana Benavides-Prado, Michael Witbrock, Patricia J. Riddle

Figure 1 for Answering Unseen Questions With Smaller Language\\Models Using Rationale Generation and Dense Retrieval
Figure 2 for Answering Unseen Questions With Smaller Language\\Models Using Rationale Generation and Dense Retrieval
Figure 3 for Answering Unseen Questions With Smaller Language\\Models Using Rationale Generation and Dense Retrieval
Figure 4 for Answering Unseen Questions With Smaller Language\\Models Using Rationale Generation and Dense Retrieval

When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.

Viaarxiv icon

Meerkat Behaviour Recognition Dataset

Jun 20, 2023
Mitchell Rogers, Gaël Gendron, David Arturo Soriano Valdez, Mihailo Azhar, Yang Chen, Shahrokh Heidari, Caleb Perelini, Padriac O'Leary, Kobe Knowles, Izak Tait, Simon Eyre, Michael Witbrock, Patrice Delmas

Figure 1 for Meerkat Behaviour Recognition Dataset
Figure 2 for Meerkat Behaviour Recognition Dataset
Figure 3 for Meerkat Behaviour Recognition Dataset

Recording animal behaviour is an important step in evaluating the well-being of animals and further understanding the natural world. Current methods for documenting animal behaviour within a zoo setting, such as scan sampling, require excessive human effort, are unfit for around-the-clock monitoring, and may produce human-biased results. Several animal datasets already exist that focus predominantly on wildlife interactions, with some extending to action or behaviour recognition. However, there is limited data in a zoo setting or data focusing on the group behaviours of social animals. We introduce a large meerkat (Suricata Suricatta) behaviour recognition video dataset with diverse annotated behaviours, including group social interactions, tracking of individuals within the camera view, skewed class distribution, and varying illumination conditions. This dataset includes videos from two positions within the meerkat enclosure at the Wellington Zoo (Wellington, New Zealand), with 848,400 annotated frames across 20 videos and 15 unannotated videos.

* Presented as a poster for the CV4Animals Workshop, CVPR 2023. For associated dataset see: https://meerkat-dataset.github.io/ 
Viaarxiv icon

Large Language Models Are Not Abstract Reasoners

May 31, 2023
Gaël Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

Figure 1 for Large Language Models Are Not Abstract Reasoners
Figure 2 for Large Language Models Are Not Abstract Reasoners
Figure 3 for Large Language Models Are Not Abstract Reasoners
Figure 4 for Large Language Models Are Not Abstract Reasoners

Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain unknown, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally limited. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we perform extensive evaluations of state-of-the-art LLMs on abstract reasoning tasks, showing that they achieve very limited performance in contrast with other natural language tasks, and we investigate the reasons for this difference. We apply techniques that have been shown to improve performance on other NLP tasks and show that in most cases their impact on abstract reasoning performance is limited. In the course of this work, we have generated a new benchmark for evaluating language models on abstract reasoning tasks.

* 33 pages, 13 pages for the main paper and 20 pages for the supplement, 26 figures 
Viaarxiv icon

Contrastive Learning with Logic-driven Data Augmentation for Logical Reasoning over Text

May 21, 2023
Qiming Bao, Alex Yuxuan Peng, Zhenyun Deng, Wanjun Zhong, Neset Tan, Nathan Young, Yang Chen, Yonghua Zhu, Michael Witbrock, Jiamou Liu

Figure 1 for Contrastive Learning with Logic-driven Data Augmentation for Logical Reasoning over Text
Figure 2 for Contrastive Learning with Logic-driven Data Augmentation for Logical Reasoning over Text
Figure 3 for Contrastive Learning with Logic-driven Data Augmentation for Logical Reasoning over Text
Figure 4 for Contrastive Learning with Logic-driven Data Augmentation for Logical Reasoning over Text

Pre-trained large language model (LLM) is under exploration to perform NLP tasks that may require logical reasoning. Logic-driven data augmentation for representation learning has been shown to improve the performance of tasks requiring logical reasoning, but most of these data rely on designed templates and therefore lack generalization. In this regard, we propose an AMR-based logical equivalence-driven data augmentation method (AMR-LE) for generating logically equivalent data. Specifically, we first parse a text into the form of an AMR graph, next apply four logical equivalence laws (contraposition, double negation, commutative and implication laws) on the AMR graph to construct a logically equivalent/inequivalent AMR graph, and then convert it into a logically equivalent/inequivalent sentence. To help the model to better learn these logical equivalence laws, we propose a logical equivalence-driven contrastive learning training paradigm, which aims to distinguish the difference between logical equivalence and inequivalence. Our AMR-LE (Ensemble) achieves #2 on the ReClor leaderboard https://eval.ai/web/challenges/challenge-page/503/leaderboard/1347 . Our model shows better performance on seven downstream tasks, including ReClor, LogiQA, MNLI, MRPC, RTE, QNLI, and QQP. The source code and dataset are public at https://github.com/Strong-AI-Lab/Logical-Equivalence-driven-AMR-Data-Augmentation-for-Representation-Learning .

Viaarxiv icon