Alert button
Picture for Sotaro Takeshita

Sotaro Takeshita

Alert button

Towards Automated Survey Variable Search and Summarization in Social Science Publications

Sep 14, 2022
Yavuz Selim Kartal, Sotaro Takeshita, Tornike Tsereteli, Kai Eckert, Henning Kroll, Philipp Mayr, Simone Paolo Ponzetto, Benjamin Zapilko, Andrea Zielinski

Figure 1 for Towards Automated Survey Variable Search and Summarization in Social Science Publications
Figure 2 for Towards Automated Survey Variable Search and Summarization in Social Science Publications
Figure 3 for Towards Automated Survey Variable Search and Summarization in Social Science Publications
Figure 4 for Towards Automated Survey Variable Search and Summarization in Social Science Publications

Nowadays there is a growing trend in many scientific disciplines to support researchers by providing enhanced information access through linking of publications and underlying datasets, so as to support research with infrastructure to enhance reproducibility and reusability of research results. In this research note, we present an overview of an ongoing research project, named VADIS (VAriable Detection, Interlinking and Summarization), that aims at developing technology and infrastructure for enhanced information access in the Social Sciences via search and summarization of publications on the basis of automatic identification and indexing of survey variables in text. We provide an overview of the overarching vision underlying our project, its main components, and related challenges, as well as a thorough discussion of how these are meant to address the limitations of current information access systems for publications in the Social Sciences. We show how this goal can be concretely implemented in an end-user system by presenting a search prototype, which is based on user requirements collected from qualitative interviews with empirical Social Science researchers.

* 10 pages, 2 figures 
Viaarxiv icon

X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

May 30, 2022
Sotaro Takeshita, Tommaso Green, Niklas Friedrich, Kai Eckert, Simone Paolo Ponzetto

Figure 1 for X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents
Figure 2 for X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents
Figure 3 for X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents
Figure 4 for X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Consequently, recent work on applying text mining technologies for scholarly publications has investigated the application of automatic text summarization technologies, including extreme summarization, for this domain. However, previous work has concentrated only on monolingual settings, primarily in English. In this paper, we fill this research gap and present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage `summarize and translate' approach and a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios.

* JCDL2022 
Viaarxiv icon

ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

May 30, 2022
Chia-Chien Hung, Tommaso Green, Robert Litschko, Tornike Tsereteli, Sotaro Takeshita, Marco Bombieri, Goran Glavaš, Simone Paolo Ponzetto

Figure 1 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
Figure 2 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
Figure 3 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
Figure 4 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on language- and domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages.

Viaarxiv icon