Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shay B. Cohen

AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite

Dec 06, 2023

Jonas Groschwitz, Shay B. Cohen, Lucia Donatelli, Meaghan Fowlie

Abstract:We present the Granular AMR Parsing Evaluation Suite (GrAPES), a challenge set for Abstract Meaning Representation (AMR) parsing with accompanying evaluation metrics. AMR parsers now obtain high scores on the standard AMR evaluation metric Smatch, close to or even above reported inter-annotator agreement. But that does not mean that AMR parsing is solved; in fact, human evaluation in previous work indicates that current parsers still quite frequently make errors on node labels or graph structure that substantially distort sentence meaning. Here, we provide an evaluation suite that tests AMR parsers on a range of phenomena of practical, technical, and linguistic interest. Our 36 categories range from seen and unseen labels, to structural generalization, to coreference. GrAPES reveals in depth the abilities and shortcomings of current AMR parsers.

* Accepted at EMNLP 2023. For the associated GitHub repository, see https://github.com/jgroschwitz/GrAPES

Via

Access Paper or Ask Questions

Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Nov 16, 2023

Yifu Qiu, Varun Embar, Shay B. Cohen, Benjamin Han

Figure 1 for Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Figure 2 for Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Figure 3 for Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Figure 4 for Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Abstract:Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts, or describe facts not present in the input. To reduce hallucinations, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge). TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on how well their corresponding hypotheses support the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with minimal impact on the quality. We then replace the NLI model with our task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their faithful and hallucinated descriptions with the hallucinated spans marked. The new HVM improves the faithfulness and the quality further and runs faster. Overall the best TWEAK variants improve on average 2.22/7.17 points on faithfulness measured by FactKB over WebNLG and TekGen/GenWiki, respectively, with only 0.14/0.32 points degradation on quality measured by BERTScore over the same datasets. Since TWEAK is a decoding-only approach, it can be integrated with any neural generative model without retraining.

Via

Access Paper or Ask Questions

Are Large Language Models Temporally Grounded?

Nov 16, 2023

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

Abstract:Are Large language models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and duration of events, their ability to order events along a timeline, and self-consistency within their temporal model (e.g., temporal relations such as after and before are mutually exclusive for any pair of events). We evaluate state-of-the-art LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities. Generally, we find that LLMs lag significantly behind both human performance as well as small-scale, specialised LMs. In-context learning, instruction tuning, and chain-of-thought prompting reduce this gap only to a limited degree. Crucially, LLMs struggle the most with self-consistency, displaying incoherent behaviour in at least 27.23% of their predictions. Contrary to expectations, we also find that scaling the model size does not guarantee positive gains in performance. To explain these results, we study the sources from which LLMs may gather temporal information: we find that sentence ordering in unlabelled texts, available during pre-training, is only weakly correlated with event ordering. Moreover, public instruction tuning mixtures contain few temporal tasks. Hence, we conclude that current LLMs lack a consistent temporal model of textual narratives. Code, datasets, and LLM outputs are available at https://github.com/yfqiu-nlp/temporal-llms.

Via

Access Paper or Ask Questions

Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Nov 15, 2023

Marcio Fonseca, Shay B. Cohen

Figure 1 for Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Figure 2 for Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Figure 3 for Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Figure 4 for Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

Abstract:Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.

Via

Access Paper or Ask Questions

A Joint Matrix Factorization Analysis of Multilingual Representations

Oct 24, 2023

Zheng Zhao, Yftah Ziser, Bonnie Webber, Shay B. Cohen

Abstract:We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. An alternative to probing, this tool allows us to analyze multiple sets of representations in a joint manner. Using this tool, we study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models. We conduct a large-scale empirical study of over 33 languages and 17 morphosyntactic categories. Our findings demonstrate variations in the encoding of morphosyntactic information across upper and lower layers, with category-specific differences influenced by language properties. Hierarchical clustering of the factorization outputs yields a tree structure that is related to phylogenetic trees manually crafted by linguists. Moreover, we find the factorization outputs exhibit strong associations with performance observed across different cross-lingual tasks. We release our code to facilitate future research.

* Accepted to Findings of EMNLP 2023

Via

Access Paper or Ask Questions

Knowledge Base Question Answering for Space Debris Queries

May 31, 2023

Paul Darm, Antonio Valerio Miceli-Barone, Shay B. Cohen, Annalisa Riccardi

Figure 1 for Knowledge Base Question Answering for Space Debris Queries

Figure 2 for Knowledge Base Question Answering for Space Debris Queries

Figure 3 for Knowledge Base Question Answering for Space Debris Queries

Figure 4 for Knowledge Base Question Answering for Space Debris Queries

Abstract:Space agencies execute complex satellite operations that need to be supported by the technical knowledge contained in their extensive information systems. Knowledge bases (KB) are an effective way of storing and accessing such information at scale. In this work we present a system, developed for the European Space Agency (ESA), that can answer complex natural language queries, to support engineers in accessing the information contained in a KB that models the orbital space debris environment. Our system is based on a pipeline which first generates a sequence of basic database operations, called a %program sketch, from a natural language question, then specializes the sketch into a concrete query program with mentions of entities, attributes and relations, and finally executes the program against the database. This pipeline decomposition approach enables us to train the system by leveraging out-of-domain data and semi-synthetic data generated by GPT-3, thus reducing overfitting and shortcut learning even with limited amount of in-domain training data. Our code can be found at \url{https://github.com/PaulDrm/DISCOSQA}.

* 7 pages, ACL 2023 industry track

Via

Access Paper or Ask Questions

Sentence-Incremental Neural Coreference Resolution

May 26, 2023

Matt Grenander, Shay B. Cohen, Mark Steedman

Figure 1 for Sentence-Incremental Neural Coreference Resolution

Figure 2 for Sentence-Incremental Neural Coreference Resolution

Figure 3 for Sentence-Incremental Neural Coreference Resolution

Figure 4 for Sentence-Incremental Neural Coreference Resolution

Abstract:We propose a sentence-incremental neural coreference resolution system which incrementally builds clusters after marking mention boundaries in a shift-reduce method. The system is aimed at bridging two recent approaches at coreference resolution: (1) state-of-the-art non-incremental models that incur quadratic complexity in document length with high computational cost, and (2) memory network-based models which operate incrementally but do not generalize beyond pronouns. For comparison, we simulate an incremental setting by constraining non-incremental systems to form partial coreference chains before observing new sentences. In this setting, our system outperforms comparable state-of-the-art methods by 2 F1 on OntoNotes and 7 F1 on the CODI-CRAC 2021 corpus. In a conventional coreference setup, our system achieves 76.3 F1 on OntoNotes and 45.8 F1 on CODI-CRAC 2021, which is comparable to state-of-the-art baselines. We also analyze variations of our system and show that the degree of incrementality in the encoder has a surprisingly large effect on the resulting performance.

* Accepted at EMNLP 2022

Via

Access Paper or Ask Questions

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

May 24, 2023

Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

Figure 1 for The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Figure 2 for The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Figure 3 for The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Figure 4 for The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Abstract:Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.

* 17 pages, 5 figure, ACL 2023

Via

Access Paper or Ask Questions

Detecting and Mitigating Hallucinations in Multilingual Summarisation

May 23, 2023

Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

Abstract:Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation. While automatically generated summaries may be fluent, they often lack faithfulness to the original document. This issue becomes even more pronounced in low-resource settings, such as cross-lingual transfer. With the existing faithful metrics focusing on English, even measuring the extent of this phenomenon in cross-lingual settings is hard. To address this, we first develop a novel metric, mFACT, evaluating the faithfulness of non-English summaries, leveraging translation-based transfer from multiple English faithfulness metrics. We then propose a simple but effective method to reduce hallucinations with a cross-lingual transfer, which weighs the loss of each training example by its faithfulness score. Through extensive experiments in multiple languages, we demonstrate that mFACT is the metric that is most suited to detect hallucinations. Moreover, we find that our proposed loss weighting method drastically increases both performance and faithfulness according to both automatic and human evaluation when compared to strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset are available at https://github.com/yfqiu-nlp/mfact-summ.

Via

Access Paper or Ask Questions

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

May 15, 2023

Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay B. Cohen, Manish Shrivastava, Barry Haddow

Abstract:This paper introduces PMIndiaSum, a new multilingual and massively parallel headline summarization corpus focused on languages in India. Our corpus covers four language families, 14 languages, and the largest to date, 196 language pairs. It provides a testing ground for all cross-lingual pairs. We detail our workflow to construct the corpus, including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding the summarization of Indian texts. Our dataset is publicly available and can be freely modified and re-distributed.

Via

Access Paper or Ask Questions