We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost. It utilizes precomputed document representations extracted by a base dense retrieval method and involves training a model to jointly score a large set of retrieved candidate documents for each query, while potentially transforming on the fly the representation of each document in the context of the other candidates as well as the query itself. When scoring a document representation based on its similarity to a query, the model is thus aware of the representation of its "peer" documents. We show that our approach leads to substantial improvement in retrieval performance over the base method and over scoring candidate documents in isolation from one another, as in a pair-wise training setting. Crucially, unlike term-interaction rerankers based on BERT-like encoders, it incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method. Finally, concurrently considering a set of candidate documents for a given query enables additional valuable capabilities in retrieval, such as score calibration and mitigating societal biases in ranking.
Research into deep learning models for molecular property prediction has primarily focused on the development of better Graph Neural Network (GNN) architectures. Though new GNN variants continue to improve performance, their modifications share a common theme of alleviating problems intrinsic to their fundamental graph-to-graph nature. In this work, we examine these limitations and propose a new molecular representation that bypasses the need for GNNs entirely, dubbed CubeMol. Our fixed-dimensional stochastic representation, when paired with a transformer model, exceeds the performance of state-of-the-art GNN models and provides a path for scalability.
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses, annotated for semantic clause types and coherence relations that allow for nuanced comparison of artificial and natural discourse modes. The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2 (Zellers et al., 2019) and GPT-3(Brown et al., 2020). We showcase the usefulness of this corpus for detailed discourse analysis of text generation by providing preliminary evidence that less numerous, shorter and more often incoherent clause relations are associated with lower perceived quality of computer-generated narratives and arguments.
The recent success of distributed word representations has led to an increased interest in analyzing the properties of their spatial distribution. Current metrics suggest that contextualized word embedding models do not uniformly utilize all dimensions when embedding tokens in vector space. Here we argue that existing metrics are fragile and tend to obfuscate the true spatial distribution of point clouds. To ameliorate this issue, we propose IsoScore: a novel metric which quantifies the degree to which a point cloud uniformly utilizes the ambient vector space. We demonstrate that IsoScore has several desirable properties such as mean invariance and direct correspondence to the number of dimensions used, which are properties that existing scores do not possess. Furthermore, IsoScore is conceptually intuitive and computationally efficient, making it well suited for analyzing the distribution of point clouds in arbitrary vector spaces, not necessarily limited to those of word embeddings alone. Additionally, we use IsoScore to demonstrate that a number of recent conclusions in the NLP literature that have been derived using brittle metrics of spatial distribution, such as average cosine similarity, may be incomplete or altogether inaccurate.
Numerous neural retrieval models have been proposed in recent years. These models learn to compute a ranking score between the given query and document. The majority of existing models are trained in pairwise fashion using human-judged labels directly without further calibration. The traditional pairwise schemes can be time-consuming and require pre-defined positive-negative document pairs for training, potentially leading to learning bias due to document distribution mismatch between training and test conditions. Some popular existing listwise schemes rely on the strong pre-defined probabilistic assumptions and stark difference between relevant and non-relevant documents for the given query, which may limit the model potential due to the low-quality or ambiguous relevance labels. To address these concerns, we turn to a physics-inspired ranking balance scheme and propose PoolRank, a pooling-based listwise learning framework. The proposed scheme has four major advantages: (1) PoolRank extracts training information from the best candidates at the local level based on model performance and relative ranking among abundant document candidates. (2) By combining four pooling-based loss components in a multi-task learning fashion, PoolRank calibrates the ranking balance for the partially relevant and the highly non-relevant documents automatically without costly human inspection. (3) PoolRank can be easily generalized to any neural retrieval model without requiring additional learnable parameters or model structure modifications. (4) Compared to pairwise learning and existing listwise learning schemes, PoolRank yields better ranking performance for all studied retrieval models while retaining efficient convergence rates.
The goal of information retrieval is to recommend a list of document candidates that are most relevant to a given query. Listwise learning trains neural retrieval models by comparing various candidates simultaneously on a large scale, offering much more competitive performance than pairwise and pointwise schemes. Existing listwise ranking losses treat the candidate document list as a whole unit without further inspection. Some candidates with moderate semantic prominence may be ignored by the noisy similarity signals or overshadowed by a few especially pronounced candidates. As a result, existing ranking losses fail to exploit the full potential of neural retrieval models. To address these concerns, we apply the classic pooling technique to conduct multi-level coarse graining and propose ExpertRank, a novel expert-based listwise ranking loss. The proposed scheme has three major advantages: (1) ExpertRank introduces the profound physics concept of coarse graining to information retrieval by selecting prominent candidates at various local levels based on model prediction and inter-document comparison. (2) ExpertRank applies the mixture of experts (MoE) technique to combine different experts effectively by extending the traditional ListNet. (3) Compared to other existing listwise learning approaches, ExpertRank produces much more reliable and competitive performance for various neural retrieval models with different complexities, from traditional models, such as KNRM, ConvKNRM, MatchPyramid, to sophisticated BERT/ALBERT-based retrieval models.
Despite advances in neural machine translation, cross-lingual retrieval tasks in which queries and documents live in different natural language spaces remain challenging. Although neural translation models may provide an intuitive approach to tackle the cross-lingual problem, their resource-consuming training and advanced model structures may complicate the overall retrieval pipeline and reduce users engagement. In this paper, we build our end-to-end Cross-Lingual Arabic Information REtrieval (CLAIRE) system based on the cross-lingual word embedding where searchers are assumed to have a passable passive understanding of Arabic and various supporting information in English is provided to aid retrieval experience. The proposed system has three major advantages: (1) The usage of English-Arabic word embedding simplifies the overall pipeline and avoids the potential mistakes caused by machine translation. (2) Our CLAIRE system can incorporate arbitrary word embedding-based neural retrieval models without structural modification. (3) Early empirical results on an Arabic news collection show promising performance.
Existing neural ranking models follow the text matching paradigm, where document-to-query relevance is estimated through predicting the matching score. Drawing from the rich literature of classical generative retrieval models, we introduce and formalize the paradigm of deep generative retrieval models defined via the cumulative probabilities of generating query terms. This paradigm offers a grounded probabilistic view on relevance estimation while still enabling the use of modern neural architectures. In contrast to the matching paradigm, the probabilistic nature of generative rankers readily offers a fine-grained measure of uncertainty. We adopt several current neural generative models in our framework and introduce a novel generative ranker (T-PGN), which combines the encoding capacity of Transformers with the Pointer Generator Network model. We conduct an extensive set of evaluation experiments on passage retrieval, leveraging the MS MARCO Passage Re-ranking and TREC Deep Learning 2019 Passage Re-ranking collections. Our results show the significantly higher performance of the T-PGN model when compared with other generative models. Lastly, we demonstrate that exploiting the uncertainty information of deep generative rankers opens new perspectives to query/collection understanding, and significantly improves the cut-off prediction task.
In the pursuit of natural language understanding, there has been a long standing interest in tracking state changes throughout narratives. Impressive progress has been made in modeling the state of transaction-centric dialogues and procedural texts. However, this problem has been less intensively studied in the realm of general discourse where ground truth descriptions of states may be loosely defined and state changes are less densely distributed over utterances. This paper proposes to turn to simplified, fully observable systems that show some of these properties: Sports events. We curated 2,263 soccer matches including time-stamped natural language commentary accompanied by discrete events such as a team scoring goals, switching players or being penalized with cards. We propose a new task formulation where, given paragraphs of commentary of a game at different timestamps, the system is asked to recognize the occurrence of in-game events. This domain allows for rich descriptions of state while avoiding the complexities of many other real-world settings. As an initial point of performance measurement, we include two baseline methods from the perspectives of sentence classification with temporal dependence and current state-of-the-art generative model, respectively, and demonstrate that even sophisticated existing methods struggle on the state tracking task when the definition of state broadens or non-event chatter becomes prevalent.
In any ranking system, the retrieval model outputs a single score for a document based on its belief on how relevant it is to a given search query. While retrieval models have continued to improve with the introduction of increasingly complex architectures, few works have investigated a retrieval model's belief in the score beyond the scope of a single value. We argue that capturing the model's uncertainty with respect to its own scoring of a document is a critical aspect of retrieval that allows for greater use of current models across new document distributions, collections, or even improving effectiveness for down-stream tasks. In this paper, we address this problem via an efficient Bayesian framework for retrieval models which captures the model's belief in the relevance score through a stochastic process while adding only negligible computational overhead. We evaluate this belief via a ranking based calibration metric showing that our approximate Bayesian framework significantly improves a retrieval model's ranking effectiveness through a risk aware reranking as well as its confidence calibration. Lastly, we demonstrate that this additional uncertainty information is actionable and reliable on down-stream tasks represented via cutoff prediction.