Legal case retrieval is a special Information Retrieval~(IR) task focusing on legal case documents. Depending on the downstream tasks of the retrieved case documents, users' information needs in legal case retrieval could be significantly different from those in Web search and traditional ad-hoc retrieval tasks. While there are several studies that retrieve legal cases based on text similarity, the underlying search intents of legal retrieval users, as shown in this paper, are more complicated than that yet mostly unexplored. To this end, we present a novel hierarchical intent taxonomy of legal case retrieval. It consists of five intent types categorized by three criteria, i.e., search for Particular Case(s), Characterization, Penalty, Procedure, and Interest. The taxonomy was constructed transparently and evaluated extensively through interviews, editorial user studies, and query log analysis. Through a laboratory user study, we reveal significant differences in user behavior and satisfaction under different search intents in legal case retrieval. Furthermore, we apply the proposed taxonomy to various downstream legal retrieval tasks, e.g., result ranking and satisfaction prediction, and demonstrate its effectiveness. Our work provides important insights into the understanding of user intents in legal case retrieval and potentially leads to better retrieval techniques in the legal domain, such as intent-aware ranking strategies and evaluation methodologies.
Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance. To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.
Recent advance in Dense Retrieval (DR) techniques has significantly improved the effectiveness of first-stage retrieval. Trained with large-scale supervised data, DR models can encode queries and documents into a low-dimensional dense space and conduct effective semantic matching. However, previous studies have shown that the effectiveness of DR models would drop by a large margin when the trained DR models are adopted in a target domain that is different from the domain of the labeled data. One of the possible reasons is that the DR model has never seen the target corpus and thus might be incapable of mitigating the difference between the training and target domains. In practice, unfortunately, training a DR model for each target domain to avoid domain shift is often a difficult task as it requires additional time, storage, and domain-specific data labeling, which are not always available. To address this problem, in this paper, we propose a novel DR framework named Disentangled Dense Retrieval (DDR) to support effective and flexible domain adaptation for DR models. DDR consists of a Relevance Estimation Module (REM) for modeling domain-invariant matching patterns and several Domain Adaption Modules (DAMs) for modeling domain-specific features of multiple target corpora. By making the REM and DAMs disentangled, DDR enables a flexible training paradigm in which REM is trained with supervision once and DAMs are trained with unsupervised data. Comprehensive experiments in different domains and languages show that DDR significantly improves ranking performance compared to strong DR baselines and substantially outperforms traditional retrieval methods in most scenarios.
A retrieval model should not only interpolate the training data but also extrapolate well to the queries that are rather different from the training data. While dense retrieval (DR) models have been demonstrated to achieve better retrieval performance than the traditional term-based retrieval models, we still know little about whether they can extrapolate. To shed light on the research question, we investigate how DR models perform in both the interpolation and extrapolation regimes. We first investigate the distribution of training and test data on popular retrieval benchmarks and identify a considerable overlap in query entities, query intent, and relevance labels. This finding implies that the performance on these test sets is biased towards interpolation and cannot accurately reflect the extrapolation capacity. Therefore, to evaluate the extrapolation performance of DR models, we propose two resampling strategies for existing retrieval benchmarks and comprehensively investigate how DR models perform. Results show that DR models may interpolate as well as complex interaction-based models (e.g., BERT and ColBERT) but extrapolate substantially worse. Among various DR training strategies, text-encoding pretraining and target-domain pretraining are particularly effective for improving the extrapolation capacity. Finally, we compare the extrapolation capacity with domain transfer ability. Despite its simplicity and ease of use, the extrapolation performance can reflect the domain transfer ability in some domains of the BEIR dataset, further highlighting the feasibility of our approaches in evaluating the generalizability of DR models.
Proactive dialogue system is able to lead the conversation to a goal topic and has advantaged potential in bargain, persuasion and negotiation. Current corpus-based learning manner limits its practical application in real-world scenarios. To this end, we contribute to advance the study of the proactive dialogue policy to a more natural and challenging setting, i.e., interacting dynamically with users. Further, we call attention to the non-cooperative user behavior -- the user talks about off-path topics when he/she is not satisfied with the previous topics introduced by the agent. We argue that the targets of reaching the goal topic quickly and maintaining a high user satisfaction are not always converge, because the topics close to the goal and the topics user preferred may not be the same. Towards this issue, we propose a new solution named I-Pro that can learn Proactive policy in the Interactive setting. Specifically, we learn the trade-off via a learned goal weight, which consists of four factors (dialogue turn, goal completion difficulty, user satisfaction estimation, and cooperative degree). The experimental results demonstrate I-Pro significantly outperforms baselines in terms of effectiveness and interpretability.
Recommender systems are usually developed and evaluated on the historical user-item logs. However, most offline recommendation datasets are highly sparse and contain various biases, which hampers the evaluation of recommendation policies. Existing efforts aim to improve the data quality by collecting users' preferences on randomly selected items (e.g., Yahoo! and Coat). However, they still suffer from the high variance issue caused by the sparsely observed data. To fundamentally solve the problem, we present KuaiRec, a fully-observed dataset collected from the social video-sharing mobile App, Kuaishou. The feedback of 1,411 users on almost all of the 3,327 videos is explicitly observed. To the best of our knowledge, this is the first real-world fully-observed dataset with millions of user-item interactions in recommendation. To demonstrate the advantage of KuaiRec, we leverage it to explore the key questions in evaluating conversational recommender systems. The experimental results show that two factors in traditional partially-observed data -- the data density and the exposure bias -- greatly affect the evaluation results. This entails the significance of our fully-observed data in researching many directions in recommender systems, e.g., the unbiased recommendation, interactive/conversational recommendation, and evaluation. We release the dataset and the pipeline implementation for evaluation at https://chongminggao.github.io/KuaiRec/.
Dense Retrieval (DR) reaches state-of-the-art results in first-stage retrieval, but little is known about the mechanisms that contribute to its success. Therefore, in this work, we conduct an interpretation study of recently proposed DR models. Specifically, we first discretize the embeddings output by the document and query encoders. Based on the discrete representations, we analyze the attribution of input tokens. Both qualitative and quantitative experiments are carried out on public test collections. Results suggest that DR models pay attention to different aspects of input and extract various high-level topic representations. Therefore, we can regard the representations learned by DR models as a mixture of high-level topics.
Dense Retrieval (DR) has achieved state-of-the-art first-stage ranking effectiveness. However, the efficiency of most existing DR models is limited by the large memory cost of storing dense vectors and the time-consuming nearest neighbor search (NNS) in vector space. Therefore, we present RepCONC, a novel retrieval model that learns discrete Representations via CONstrained Clustering. RepCONC jointly trains dual-encoders and the Product Quantization (PQ) method to learn discrete document representations and enables fast approximate NNS with compact indexes. It models quantization as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and supports end-to-end optimization of the quantization method and dual-encoders. We theoretically demonstrate the importance of the uniform clustering constraint in RepCONC and derive an efficient approximate solution for constrained clustering by reducing it to an instance of the optimal transport problem. Besides constrained clustering, RepCONC further adopts a vector-based inverted file system (IVF) to support highly efficient vector search on CPUs. Extensive experiments on two popular ad-hoc retrieval benchmarks show that RepCONC achieves better ranking effectiveness than competitive vector quantization baselines under different compression ratio settings. It also substantially outperforms a wide range of existing retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency.
Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.
Recently, Information Retrieval community has witnessed fast-paced advances in Dense Retrieval (DR), which performs first-stage retrieval with embedding-based search. Despite the impressive ranking performance, previous studies usually adopt brute-force search to acquire candidates, which is prohibitive in practical Web search scenarios due to its tremendous memory usage and time cost. To overcome these problems, vector compression methods have been adopted in many practical embedding-based retrieval applications. One of the most popular methods is Product Quantization (PQ). However, although existing vector compression methods including PQ can help improve the efficiency of DR, they incur severely decayed retrieval performance due to the separation between encoding and compression. To tackle this problem, we present JPQ, which stands for Joint optimization of query encoding and Product Quantization. It trains the query encoder and PQ index jointly in an end-to-end manner based on three optimization strategies, namely ranking-oriented loss, PQ centroid optimization, and end-to-end negative sampling. We evaluate JPQ on two publicly available retrieval benchmarks. Experimental results show that JPQ significantly outperforms popular vector compression methods. Compared with previous DR models that use brute-force search, JPQ almost matches the best retrieval performance with 30x compression on index size. The compressed index further brings 10x speedup on CPU and 2x speedup on GPU in query latency.