Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Mlsum It

On the State of German Text Summarization

Jan 17, 2023

Dennis Aumiller, Jing Fan, Michael Gertz

Abstract:With recent advancements in the area of Natural Language Processing, the focus is slowly shifting from a purely English-centric view towards more language-specific solutions, including German. Especially practical for businesses to analyze their growing amount of textual data are text summarization systems, which transform long input documents into compressed and more digestible summary texts. In this work, we assess the particular landscape of German abstractive text summarization and investigate the reasons why practically useful solutions for abstractive text summarization are still absent in industry. Our focus is two-fold, analyzing a) training resources, and b) publicly available summarization systems. We are able to show that popular existing datasets exhibit crucial flaws in their assumptions about the original sources, which frequently leads to detrimental effects on system generalization and evaluation biases. We confirm that for the most popular training dataset, MLSUM, over 50% of the training set is unsuitable for abstractive summarization purposes. Furthermore, available systems frequently fail to compare to simple baselines, and ignore more effective and efficient extractive summarization approaches. We attribute poor evaluation quality to a variety of different factors, which are investigated in more detail in this work: A lack of qualitative (and diverse) gold data considered for training, understudied (and untreated) positional biases in some of the existing datasets, and the lack of easily accessible and streamlined pre-processing strategies or analysis tools. We provide a comprehensive assessment of available models on the cleaned datasets, and find that this can lead to a reduction of more than 20 ROUGE-1 points during evaluation. The code for dataset filtering and reproducing results can be found online at https://github.com/dennlinger/summaries

* Accepted at the 20th Conference on Database Systems for Business, Technology and Web (BTW'23)

Via

Access Paper or Ask Questions

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Apr 29, 2022

Ruipeng Jia, Xingxing Zhang, Yanan Cao, Shi Wang, Zheng Lin, Furu Wei

Figure 1 for Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Figure 2 for Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Figure 3 for Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Figure 4 for Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Abstract:In zero-shot multilingual extractive text summarization, a model is typically trained on English summarization dataset and then applied on summarization datasets of other languages. Given English gold summaries and documents, sentence-level labels for extractive summarization are usually generated using heuristics. However, these monolingual labels created on English datasets may not be optimal on datasets of other languages, for that there is the syntactic or semantic discrepancy between different languages. In this way, it is possible to translate the English dataset to other languages and obtain different sets of labels again using heuristics. To fully leverage the information of these different sets of labels, we propose NLSSum (Neural Label Search for Summarization), which jointly learns hierarchical weights for these different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations across these two datasets.

Via

Access Paper or Ask Questions

MLSUM: The Multilingual Summarization Corpus

Apr 30, 2020

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano

Figure 1 for MLSUM: The Multilingual Summarization Corpus

Figure 2 for MLSUM: The Multilingual Summarization Corpus

Figure 3 for MLSUM: The Multilingual Summarization Corpus

Figure 4 for MLSUM: The Multilingual Summarization Corpus

Abstract:We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

Via

Access Paper or Ask Questions

Topic:Mlsum It

Papers and Code

On the State of German Text Summarization

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

MLSUM: The Multilingual Summarization Corpus