Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benno Stein

Shammie

Argumentation in Waltz's "Emerging Structure of International Politics''

Dec 31, 2023

Magdalena Wolska, Bernd Fröhlich, Katrin Girgensohn, Sassan Gholiagha, Dora Kiesel, Jürgen Neyer, Patrick Riehmann, Mitja Sienknecht, Benno Stein

Figure 1 for Argumentation in Waltz's "Emerging Structure of International Politics''

Figure 2 for Argumentation in Waltz's "Emerging Structure of International Politics''

Figure 3 for Argumentation in Waltz's "Emerging Structure of International Politics''

Abstract:We present an annotation scheme for argumentative and domain-specific aspects of scholarly articles on the theory of International Relations. At argumentation level we identify Claims and Support/Attack relations. At domain level we model discourse content in terms of Theory and Data-related statements. We annotate Waltz's 1993 text on structural realism and show that our scheme can be reliably applied by domain experts enables insights on two research questions on justifications of claims.

* 9 pages

Via

Access Paper or Ask Questions

Evaluating Generative Ad Hoc Information Retrieval

Nov 08, 2023

Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guide Zucoon, Benno Stein(+2 more)

Figure 1 for Evaluating Generative Ad Hoc Information Retrieval

Figure 2 for Evaluating Generative Ad Hoc Information Retrieval

Figure 3 for Evaluating Generative Ad Hoc Information Retrieval

Figure 4 for Evaluating Generative Ad Hoc Information Retrieval

Abstract:Recent advances in large language models have enabled the development of viable generative information retrieval systems. A generative retrieval system returns a grounded generated text in response to an information need instead of the traditional document ranking. Quantifying the utility of these types of responses is essential for evaluating generative retrieval systems. As the established evaluation methodology for ranking-based ad hoc retrieval may seem unsuitable for generative retrieval, new approaches for reliable, repeatable, and reproducible experimentation are required. In this paper, we survey the relevant information retrieval and natural language processing literature, identify search tasks and system architectures in generative retrieval, develop a corresponding user model, and study its operationalization. This theoretical analysis provides a foundation and new insights for the evaluation of generative ad hoc retrieval systems.

* 14 pages, 5 figures, 1 table

Via

Access Paper or Ask Questions

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Sep 18, 2023

David Tschirschwitz, Christian Benz, Morris Florek, Henrik Norderhus, Benno Stein, Volker Rodehorst

Figure 1 for Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Figure 2 for Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Figure 3 for Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Figure 4 for Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Abstract:The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.

Via

Access Paper or Ask Questions

The Information Retrieval Experiment Platform

May 30, 2023

Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

Abstract:We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However, none of this is a must for reproducibility and scalability, as TIRA can run any dockerized software locally or remotely in a cloud-native execution environment. Version control and caching ensure efficient (re)execution. TIRA allows for blind evaluation when an experiment runs on a remote server or cloud not under the control of the experimenter. The test data and ground truth are then hidden from public access, and the retrieval software has to process them in a sandbox that prevents data leaks. We currently host an instance of TIREx with 15 corpora (1.9 billion documents) on which 32 shared retrieval tasks are based. Using Docker images of 50 standard retrieval approaches, we automatically evaluated all approaches on all tasks (50 $\cdot$ 32 = 1,600~runs) in less than a week on a midsize cluster (1,620 CPU cores and 24 GPUs). This instance of TIREx is open for submissions and will be integrated with the IR Anthology, as well as released open source.

* 11 pages. To be published in the proceedings of SIGIR 2023

Via

Access Paper or Ask Questions

Perspectives on Large Language Models for Relevance Judgment

Apr 13, 2023

Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein(+1 more)

Figure 1 for Perspectives on Large Language Models for Relevance Judgment

Figure 2 for Perspectives on Large Language Models for Relevance Judgment

Figure 3 for Perspectives on Large Language Models for Relevance Judgment

Figure 4 for Perspectives on Large Language Models for Relevance Judgment

Abstract:When asked, current large language models (LLMs) like ChatGPT claim that they can assist us with relevance judgments. Many researchers think this would not lead to credible IR research. In this perspective paper, we discuss possible ways for LLMs to assist human experts along with concerns and issues that arise. We devise a human-machine collaboration spectrum that allows categorizing different relevance judgment strategies, based on how much the human relies on the machine. For the extreme point of "fully automated assessment", we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing two opposing perspectives - for and against the use of LLMs for automatic relevance judgments - and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR researchers. We hope to start a constructive discussion within the community to avoid a stale-mate during review, where work is dammed if is uses LLMs for evaluation and dammed if it doesn't.

Via

Access Paper or Ask Questions

The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Apr 02, 2023

Jan Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, Martin Potthast

Figure 1 for The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Figure 2 for The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Figure 3 for The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Figure 4 for The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Abstract:The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

* 12 pages. To be published in the proceedings of SIGIR 2023

Via

Access Paper or Ask Questions

The Touché23-ValueEval Dataset for Identifying Human Values behind Arguments

Jan 31, 2023

Nailia Mirzakhmedova, Johannes Kiesel, Milad Alshomary, Maximilian Heinrich, Nicolas Handke, Xiaoni Cai, Barriere Valentin, Doratossadat Dastgheib, Omid Ghahroodi, Mohammad Ali Sadraei(+4 more)

Abstract:We present the Touch\'e23-ValueEval Dataset for Identifying Human Values behind Arguments. To investigate approaches for the automated detection of human values behind arguments, we collected 9324 arguments from 6 diverse sources, covering religious texts, political discussions, free-text arguments, newspaper editorials, and online democracy platforms. Each argument was annotated by 3 crowdworkers for 54 values. The Touch\'e23-ValueEval dataset extends the Webis-ArgValues-22. In comparison to the previous dataset, the effectiveness of a 1-Baseline decreases, but that of an out-of-the-box BERT model increases. Therefore, though the classification difficulty increased as per the label distribution, the larger dataset allows for training better models.

Via

Access Paper or Ask Questions

Paraphrase Acquisition from Image Captions

Jan 26, 2023

Marcel Gohsen, Matthias Hagen, Martin Potthast, Benno Stein

Figure 1 for Paraphrase Acquisition from Image Captions

Figure 2 for Paraphrase Acquisition from Image Captions

Figure 3 for Paraphrase Acquisition from Image Captions

Figure 4 for Paraphrase Acquisition from Image Captions

Abstract:We propose to use captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same "message") and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles. The paper introduces the underlying mining technology and compares known paraphrase corpora with respect to their syntactic and semantic paraphrase similarity to our new resource. In this context, we introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different sources. An annotation study demonstrates the high reliability of the algorithmically determined characteristic maps.

Via

Access Paper or Ask Questions

Topic Ontologies for Arguments

Jan 23, 2023

Yamen Ajjour, Johannes Kiesel, Benno Stein, Martin Potthast

Abstract:Many computational argumentation tasks, like stance classification, are topic-dependent: the effectiveness of approaches to these tasks significantly depends on whether the approaches were trained on arguments from the same topics as those they are tested on. So, which are these topics that researchers train approaches on? This paper contributes the first comprehensive survey of topic coverage, assessing 45 argument corpora. For the assessment, we take the first step towards building an argument topic ontology, consulting three diverse authoritative sources: the World Economic Forum, the Wikipedia list of controversial topics, and Debatepedia. Comparing the topic sets between the authoritative sources and corpora, our analysis shows that the corpora topics-which are mostly those frequently discussed in public online fora - are covered well by the sources. However, other topics from the sources are less extensively covered by the corpora of today, revealing interesting future directions for corpus construction.

Via

Access Paper or Ask Questions

The Infinite Index: Information Retrieval on Generative Text-To-Image Models

Dec 14, 2022

Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, Martin Potthast

Abstract:The text-to-image model Stable Diffusion has recently become very popular. Only weeks after its open source release, millions are experimenting with image generation. This is due to its ease of use, since all it takes is a brief description of the desired image to "prompt" the generative model. Rarely do the images generated for a new prompt immediately meet the user's expectations. Usually, an iterative refinement of the prompt ("prompt engineering") is necessary for satisfying images. As a new perspective, we recast image prompt engineering as interactive image retrieval - on an "infinite index". Thereby, a prompt corresponds to a query and prompt engineering to query refinement. Selected image-prompt pairs allow direct relevance feedback, as the model can modify an image for the refined prompt. This is a form of one-sided interactive retrieval, where the initiative is on the user side, whereas the server side remains stateless. In light of an extensive literature review, we develop these parallels in detail and apply the findings to a case study of a creative search task on such a model. We note that the uncertainty in searching an infinite index is virtually never-ending. We also discuss future research opportunities related to retrieval models specialized for generative models and interactive generative image retrieval. The application of IR technology, such as query reformulation and relevance feedback, will contribute to improved workflows when using generative models, while the notion of an infinite index raises new challenges in IR research.

* Accepted at CHIIR 2023

Via

Access Paper or Ask Questions