Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benjamin Piwowarski

The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

Jan 21, 2026

Pierre-Antoine Lequeu, Léo Labat, Laurène Cave, Gaël Lejeune, François Yvon, Benjamin Piwowarski

Abstract:LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand Débat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.

* 31 pages including 22 for references and appendix, 13 figures

Via

Access Paper or Ask Questions

QUESTER: Query Specification for Generative Retrieval

Nov 07, 2025

Arthur Satouf, Yuxuan Zong, Habiboulaye Amadou-Boubacar, Pablo Piantanida, Benjamin Piwowarski

Abstract:Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency

Via

Access Paper or Ask Questions

On the Representations of Entities in Auto-regressive Large Language Models

Oct 10, 2025

Victor Morand, Josiane Mothe, Benjamin Piwowarski

Abstract:Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging _task vectors_, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the _Entity Lens_, extending the _logit-lens_ to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training. Our code is avalable at https://github.com/VictorMorand/EntityRepresentations .

* Accepted at BlackBoxNLP@EMNLP2025

Via

Access Paper or Ask Questions

Rational Retrieval Acts: Leveraging Pragmatic Reasoning to Improve Sparse Retrieval

May 06, 2025

Arthur Satouf, Gabriel Ben Zenou, Benjamin Piwowarski, Habiboulaye Amadou Boubacar, Pablo Piantanida

Abstract:Current sparse neural information retrieval (IR) methods, and to a lesser extent more traditional models such as BM25, do not take into account the document collection and the complex interplay between different term weights when representing a single document. In this paper, we show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case -- and in particular to the high number of potential features (here, tokens). RSA dynamically modulates token-document interactions by considering the influence of other documents in the dataset, better contrasting document representations. Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and achieves state-of-the-art performance on out-of-domain datasets from the BEIR benchmark. https://github.com/arthur-75/Rational-Retrieval-Acts

* 6 pages - 2 figures - conference: accepted at SIGIR 2025

Via

Access Paper or Ask Questions

Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Apr 17, 2025

Yuxuan Zong, Benjamin Piwowarski

Figure 1 for Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Figure 2 for Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Figure 3 for Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Figure 4 for Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Abstract:Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

* Accepted at SIGIR 2025 Full Paper Track

Via

Access Paper or Ask Questions

Structural Deep Encoding for Table Question Answering

Mar 03, 2025

Raphaël Mouravieff, Benjamin Piwowarski, Sylvain Lamprier

Abstract:Although Transformers-based architectures excel at processing textual information, their naive adaptation for tabular data often involves flattening the table structure. This simplification can lead to the loss of essential inter-dependencies between rows, columns, and cells, while also posing scalability challenges for large tables. To address these issues, prior works have explored special tokens, structured embeddings, and sparse attention patterns. In this paper, we conduct a comprehensive analysis of tabular encoding techniques, which highlights the crucial role of attention sparsity in preserving structural information of tables. We also introduce a set of novel sparse attention mask designs for tabular data, that not only enhance computational efficiency but also preserve structural integrity, leading to better overall performance.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends

Jan 04, 2025

Camille Barboule, Benjamin Piwowarski, Yoan Chabot

Abstract:Using Large Language Models (LLMs) for Visually-rich Document Understanding (VrDU) has significantly improved performance on tasks requiring both comprehension and generation, such as question answering, albeit introducing new challenges. This survey explains how VrDU models enhanced by LLMs function, covering methods for integrating VrD features into LLMs and highlighting key challenges.

Via

Access Paper or Ask Questions

Probing Language Models on Their Knowledge Source

Oct 08, 2024

Zineddine Tighidet, Andrea Mogini, Jiali Mei, Benjamin Piwowarski, Patrick Gallinari

Abstract:Large Language Models (LLMs) often encounter conflicts between their learned, internal (parametric knowledge, PK) and external knowledge provided during inference (contextual knowledge, CK). Understanding how LLMs models prioritize one knowledge source over the other remains a challenge. In this paper, we propose a novel probing framework to explore the mechanisms governing the selection between PK and CK in LLMs. Using controlled prompts designed to contradict the model's PK, we demonstrate that specific model activations are indicative of the knowledge source employed. We evaluate this framework on various LLMs of different sizes and demonstrate that mid-layer activations, particularly those related to relations in the input, are crucial in predicting knowledge source selection, paving the way for more reliable models capable of handling knowledge conflicts effectively.

* Accepted at BlackBoxNLP@EMNLP2024

Via

Access Paper or Ask Questions

Which Neurons Matter in IR? Applying Integrated Gradients-based Methods to Understand Cross-Encoders

Jun 27, 2024

Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski

Abstract:With the recent addition of Retrieval-Augmented Generation (RAG), the scope and importance of Information Retrieval (IR) has expanded. As a result, the importance of a deeper understanding of IR models also increases. However, interpretability in IR remains under-explored, especially when it comes to the models' inner mechanisms. In this paper, we explore the possibility of adapting Integrated Gradient-based methods in an IR context to identify the role of individual neurons within the model. In particular, we provide new insights into the role of what we call "relevance" neurons, as well as how they deal with unseen data. Finally, we carry out an in-depth pruning study to validate our findings.

* Accepted at ICTIR 2024

Via

Access Paper or Ask Questions

What Makes Multimodal In-Context Learning Work?

Apr 25, 2024

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

Figure 1 for What Makes Multimodal In-Context Learning Work?

Figure 2 for What Makes Multimodal In-Context Learning Work?

Figure 3 for What Makes Multimodal In-Context Learning Work?

Figure 4 for What Makes Multimodal In-Context Learning Work?

Abstract:Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

* 20 pages, 16 figures. Accepted to CVPR 2024 Workshop on Prompting in Vision. Project page: https://folbaeni.gitlab.io/multimodal-icl

Via

Access Paper or Ask Questions