Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stefan Dietze

Which Factors Drive Open Access Publishing? A Springer Nature Case Study

Aug 17, 2022

Fakhri Momeni, Stefan Dietze, Philipp Mayr, Kristin Biesenbender, Isabella Peters

Figure 1 for Which Factors Drive Open Access Publishing? A Springer Nature Case Study

Figure 2 for Which Factors Drive Open Access Publishing? A Springer Nature Case Study

Figure 3 for Which Factors Drive Open Access Publishing? A Springer Nature Case Study

Figure 4 for Which Factors Drive Open Access Publishing? A Springer Nature Case Study

Abstract:Open Access (OA) facilitates access to articles. But, authors or funders often must pay the publishing costs preventing authors who do not receive financial support from participating in OA publishing and citation advantage for OA articles. OA may exacerbate existing inequalities in the publication system rather than overcome them. To investigate this, we studied 522,664 articles published by Springer Nature. Employing statistical methods, we describe the relationship between authors affiliated with countries from different income levels, their choice of publishing (OA or closed access), and the citation impact of their papers. A machine learning classification method helped us to explore the association between OA-publishing and attributes of the author, especially eligibility for APC-waivers or discounts, journal, country, and paper. The results indicate that authors eligible for the APC-waivers publish more in gold-OA-journals than other authors. In contrast, authors eligible for an APC discount have the lowest ratio of OA publications, leading to the assumption that this discount insufficiently motivates authors to publish in a gold-OA-journal. The rank of journals is a significant driver for publishing in a gold-OA-journal, whereas the OA option is mostly avoided in hybrid journals. Seniority, experience with OA publications, and the scientific field are the most decisive factors in OA-publishing.

Via

Access Paper or Ask Questions

Still Haven't Found What You're Looking For -- Detecting the Intent of Web Search Missions from User Interaction Features

Jul 04, 2022

Ran Yu, Limock, Stefan Dietze

Figure 1 for Still Haven't Found What You're Looking For -- Detecting the Intent of Web Search Missions from User Interaction Features

Figure 2 for Still Haven't Found What You're Looking For -- Detecting the Intent of Web Search Missions from User Interaction Features

Figure 3 for Still Haven't Found What You're Looking For -- Detecting the Intent of Web Search Missions from User Interaction Features

Figure 4 for Still Haven't Found What You're Looking For -- Detecting the Intent of Web Search Missions from User Interaction Features

Abstract:Web search is among the most frequent online activities. Whereas traditional information retrieval techniques focus on the information need behind a user query, previous work has shown that user behaviour and interaction can provide important signals for understanding the underlying intent of a search mission. An established taxonomy distinguishes between transactional, navigational and informational search missions, where in particular the latter involve a learning goal, i.e. the intent to acquire knowledge about a particular topic. We introduce a supervised approach for classifying online search missions into either of these categories by utilising a range of features obtained from the user interactions during an online search mission. Applying our model to a dataset of real-world query logs, we show that search missions can be categorised with an average F1 score of 63% and accuracy of 69%, while performance on informational and navigational missions is particularly promising (F1>75%). This suggests the potential to utilise such supervised classification during online search to better facilitate retrieval and ranking as well as to improve affiliated services, such as targeted online ads.

Via

Access Paper or Ask Questions

SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse

Jun 15, 2022

Salim Hafid, Sebastian Schellhammer, Sandra Bringay, Konstantin Todorov, Stefan Dietze

Figure 1 for SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse

Figure 2 for SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse

Figure 3 for SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse

Figure 4 for SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse

Abstract:Scientific topics, claims and resources are increasingly debated as part of online discourse, where prominent examples include discourse related to COVID-19 or climate change. This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines. For instance, communication studies aim at a deeper understanding of biases, quality or spreading pattern of scientific information whereas computational methods have been proposed to extract, classify or verify scientific claims using NLP and IR techniques. However, research across disciplines currently suffers from both a lack of robust definitions of the various forms of science-relatedness as well as appropriate ground truth data for distinguishing them. In this work, we contribute (a) an annotation framework and corresponding definitions for different forms of scientific relatedness of online discourse in Tweets, (b) an expert-annotated dataset of 1261 tweets obtained through our labeling framework reaching an average Fleiss Kappa $\kappa$ of 0.63, (c) a multi-label classifier trained on our data able to detect science-relatedness with 89% F1 and also able to detect distinct forms of scientific knowledge (claims, references). With this work we aim to lay the foundation for developing and evaluating robust methods for analysing science as part of large-scale online discourse.

* submitted to CIKM 2022

Via

Access Paper or Ask Questions

SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search

Jan 07, 2022

Christian Otto, Markus Rokicki, Georg Pardi, Wolfgang Gritz, Daniel Hienert, Ran Yu, Johannes von Hoyer, Anett Hoppe, Stefan Dietze, Peter Holtz(+2 more)

Figure 1 for SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search

Abstract:The emerging research field Search as Learning investigates how the Web facilitates learning through modern information retrieval systems. SAL research requires significant amounts of data that capture both search behavior of users and their acquired knowledge in order to obtain conclusive insights or train supervised machine learning models. However, the creation of such datasets is costly and requires interdisciplinary efforts in order to design studies and capture a wide range of features. In this paper, we address this issue and introduce an extensive dataset based on a user study, in which $114$ participants were asked to learn about the formation of lightning and thunder. Participants' knowledge states were measured before and after Web search through multiple-choice questionnaires and essay-based free recall tasks. To enable future research in SAL-related tasks we recorded a plethora of features and person-related attributes. Besides the screen recordings, visited Web pages, and detailed browsing histories, a large number of behavioral features and resource features were monitored. We underline the usefulness of the dataset by describing three, already published, use cases.

* To be published at the 2022 ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '22)

Via

Access Paper or Ask Questions

SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Aug 20, 2021

David Schindler, Felix Bensmann, Stefan Dietze, Frank Krüger

Figure 1 for SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Figure 2 for SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Figure 3 for SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Figure 4 for SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Abstract:Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci (Software Mentions in Science) a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: $\kappa{=}.82$) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usage or creation. To the best of our knowledge, SoMeSci is the most comprehensive corpus about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. Finally, we sketch potential use cases and provide baseline results.

* Preprint of CIKM 2021 Resource Paper, 10 pages

Via

Access Paper or Ask Questions

Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption

Jun 11, 2021

Christian Otto, Ran Yu, Georg Pardi, Johannes von Hoyer, Markus Rokicki, Anett Hoppe, Peter Holtz, Yvonne Kammerer, Stefan Dietze, Ralph Ewerth

Figure 1 for Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption

Figure 2 for Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption

Figure 3 for Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption

Figure 4 for Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption

Abstract:In informal learning scenarios the popularity of multimedia content, such as video tutorials or lectures, has significantly increased. Yet, the users' interactions, navigation behavior, and consequently learning outcome, have not been researched extensively. Related work in this field, also called search as learning, has focused on behavioral or text resource features to predict learning outcome and knowledge gain. In this paper, we investigate whether we can exploit features representing multimedia resource consumption to predict of knowledge gain (KG) during Web search from in-session data, that is without prior knowledge about the learner. For this purpose, we suggest a set of multimedia features related to image and video consumption. Our feature extraction is evaluated in a lab study with 113 participants where we collected data for a given search as learning task on the formation of thunderstorms and lightning. We automatically analyze the monitored log data and utilize state-of-the-art computer vision methods to extract features about the seen multimedia resources. Experimental results demonstrate that multimedia features can improve KG prediction. Finally, we provide an analysis on feature importance (text and multimedia) for KG prediction.

* 13 pages, 2 figures, 2 tables

Via

Access Paper or Ask Questions

Better Together -- An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems

Jan 14, 2021

Renato Stoffalette João, Pavlos Fafalios, Stefan Dietze

Figure 1 for Better Together -- An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems

Figure 2 for Better Together -- An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems

Figure 3 for Better Together -- An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems

Figure 4 for Better Together -- An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems

Abstract:Entity linking (EL) is the task of automatically identifying entity mentions in text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. Throughout the past decade, a plethora of EL systems and pipelines have become available, where performance of individual systems varies heavily across corpora, languages or domains. Linking performance varies even between different mentions in the same text corpus, where, for instance, some EL approaches are better able to deal with short surface forms while others may perform better when more context information is available. To this end, we argue that performance may be optimised by exploiting results from distinct EL systems on the same corpus, thereby leveraging their individual strengths on a per-mention basis. In this paper, we introduce a supervised approach which exploits the output of multiple ready-made EL systems by predicting the correct link on a per-mention basis. Experimental results obtained on existing ground truth datasets and exploiting three state-of-the-art EL systems show the effectiveness of our approach and its capacity to significantly outperform the individual EL systems as well as a set of baseline methods.

* SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

Via

Access Paper or Ask Questions

Exploiting stance hierarchies for cost-sensitive stance detection of Web documents

Jul 29, 2020

Arjun Roy, Pavlos Fafalios, Asif Ekbal, Xiaofei Zhu, Stefan Dietze

Figure 1 for Exploiting stance hierarchies for cost-sensitive stance detection of Web documents

Figure 2 for Exploiting stance hierarchies for cost-sensitive stance detection of Web documents

Figure 3 for Exploiting stance hierarchies for cost-sensitive stance detection of Web documents

Figure 4 for Exploiting stance hierarchies for cost-sensitive stance detection of Web documents

Abstract:Fact checking is an essential challenge when combating fake news. Identifying documents that agree or disagree with a particular statement (claim) is a core task in this process. In this context, stance detection aims at identifying the position (stance) of a document towards a claim. Most approaches address this task through a 4-class classification model where the class distribution is highly imbalanced. Therefore, they are particularly ineffective in detecting the minority classes (for instance, 'disagree'), even though such instances are crucial for tasks such as fact-checking by providing evidence for detecting false claims. In this paper, we exploit the hierarchical nature of stance classes, which allows us to propose a modular pipeline of cascading binary classifiers, enabling performance tuning on a per step and class basis. We implement our approach through a combination of neural and traditional classification models that highlight the misclassification costs of minority classes. Evaluation results demonstrate state-of-the-art performance of our approach and its ability to significantly improve the classification performance of the important 'disagree' class.

* 10 pages; 4 figures

Via

Access Paper or Ask Questions

Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Dec 13, 2018

Renato Stoffalette João, Pavlos Fafalios, Stefan Dietze

Figure 1 for Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Figure 2 for Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Figure 3 for Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Figure 4 for Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty

Abstract:Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.

* Preprint of paper accepted for publication in the 34th ACM/SIGAPP Symposium On Applied Computing (SAC 2019)

Via

Access Paper or Ask Questions

Time-Aware and Corpus-Specific Entity Relatedness

Oct 23, 2018

Nilamadhaba Mohapatra, Vasileios Iosifidis, Asif Ekbal, Stefan Dietze, Pavlos Fafalios

Figure 1 for Time-Aware and Corpus-Specific Entity Relatedness

Figure 2 for Time-Aware and Corpus-Specific Entity Relatedness

Abstract:Entity relatedness has emerged as an important feature in a plethora of applications such as information retrieval, entity recommendation and entity linking. Given an entity, for instance a person or an organization, entity relatedness measures can be exploited for generating a list of highly-related entities. However, the relation of an entity to some other entity depends on several factors, with time and context being two of the most important ones (where, in our case, context is determined by a particular corpus). For example, the entities related to the International Monetary Fund are different now compared to some years ago, while these entities also may highly differ in the context of a USA news portal compared to a Greek news portal. In this paper, we propose a simple but flexible model for entity relatedness which considers time and entity aware word embeddings by exploiting the underlying corpus. The proposed model does not require external knowledge and is language independent, which makes it widely useful in a variety of applications.

Via

Access Paper or Ask Questions