Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiafeng Guo

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Apr 22, 2022
Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xueqi Cheng

Figure 1 for Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Figure 2 for Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Figure 3 for Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Figure 4 for Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language models are able to boost the dense retrieval performance using a weak decoder. However, we argue that 1) it is not discriminative to decode all the input texts and, 2) even a weak decoder has the bypass effect on the encoder. Therefore, in this work, we introduce a novel contrastive span prediction task to pre-train the encoder alone, but still retain the bottleneck ability of the autoencoder. % Therefore, in this work, we propose to drop out the decoder and introduce a novel contrastive span prediction task to pre-train the encoder alone. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a group-wise contrastive loss. In this way, we can 1) learn discriminative text representations efficiently with the group-wise contrastive learning over spans and, 2) avoid the bypass effect of the decoder thoroughly. Comprehensive experiments over publicly available retrieval benchmark datasets show that our approach can outperform existing pre-training methods for dense retrieval significantly.

* Accepted to SIGIR 2022

Via

Access Paper or Ask Questions

GERE: Generative Evidence Retrieval for Fact Verification

Apr 22, 2022
Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Xueqi Cheng

Figure 1 for GERE: Generative Evidence Retrieval for Fact Verification

Figure 2 for GERE: Generative Evidence Retrieval for Fact Verification

Figure 3 for GERE: Generative Evidence Retrieval for Fact Verification

Figure 4 for GERE: Generative Evidence Retrieval for Fact Verification

Fact verification (FV) is a challenging task which aims to verify a claim using multiple evidential sentences from trustworthy corpora, e.g., Wikipedia. Most existing approaches follow a three-step pipeline framework, including document retrieval, sentence retrieval and claim verification. High-quality evidences provided by the first two steps are the foundation of the effective reasoning in the last step. Despite being important, high-quality evidences are rarely studied by existing works for FV, which often adopt the off-the-shelf models to retrieve relevant documents and sentences in an "index-retrieve-then-rank" fashion. This classical approach has clear drawbacks as follows: i) a large document index as well as a complicated search process is required, leading to considerable memory and computational overhead; ii) independent scoring paradigms fail to capture the interactions among documents and sentences in ranking; iii) a fixed number of sentences are selected to form the final evidence set. In this work, we propose GERE, the first system that retrieves evidences in a generative fashion, i.e., generating the document titles as well as evidence sentence identifiers. This enables us to mitigate the aforementioned technical issues since: i) the memory and computational cost is greatly reduced because the document index is eliminated and the heavy ranking process is replaced by a light generative process; ii) the dependency between documents and that between sentences could be captured via sequential generation process; iii) the generative formulation allows us to dynamically select a precise set of relevant evidences for each claim. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines, with both time-efficiency and memory-efficiency.

* Accepted by SIGIR 2022

Via

Access Paper or Ask Questions

PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models

Apr 04, 2022
Chen Wu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng

Figure 1 for PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models

Figure 2 for PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models

Figure 3 for PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models

Figure 4 for PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models

Neural ranking models (NRMs) have shown remarkable success in recent years, especially with pre-trained language models. However, deep neural models are notorious for their vulnerability to adversarial examples. Adversarial attacks may become a new type of web spamming technique given our increased reliance on neural information retrieval models. Therefore, it is important to study potential adversarial attacks to identify vulnerabilities of NRMs before they are deployed. In this paper, we introduce the Adversarial Document Ranking Attack (ADRA) task against NRMs, which aims to promote a target document in rankings by adding adversarial perturbations to its text. We focus on the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire the rank positions of the partial retrieved list by querying the target model. This attack setting is realistic in real-world search engines. We propose a novel Pseudo Relevance-based ADversarial ranking Attack method (PRADA) that learns a surrogate model based on Pseudo Relevance Feedback (PRF) to generate gradients for finding the adversarial perturbations. Experiments on two web search benchmark datasets show that PRADA can outperform existing attack strategies and successfully fool the NRM with small indiscernible perturbations of text.

Via

Access Paper or Ask Questions

Complex Evolutional Pattern Learning for Temporal Knowledge Graph Reasoning

Mar 20, 2022
Zixuan Li, Saiping Guan, Xiaolong Jin, Weihua Peng, Yajuan Lyu, Yong Zhu, Long Bai, Wei Li, Jiafeng Guo, Xueqi Cheng

Figure 1 for Complex Evolutional Pattern Learning for Temporal Knowledge Graph Reasoning

Figure 2 for Complex Evolutional Pattern Learning for Temporal Knowledge Graph Reasoning

Figure 3 for Complex Evolutional Pattern Learning for Temporal Knowledge Graph Reasoning

Figure 4 for Complex Evolutional Pattern Learning for Temporal Knowledge Graph Reasoning

A Temporal Knowledge Graph (TKG) is a sequence of KGs corresponding to different timestamps. TKG reasoning aims to predict potential facts in the future given the historical KG sequences. One key of this task is to mine and understand evolutional patterns of facts from these sequences. The evolutional patterns are complex in two aspects, length-diversity and time-variability. Existing models for TKG reasoning focus on modeling fact sequences of a fixed length, which cannot discover complex evolutional patterns that vary in length. Furthermore, these models are all trained offline, which cannot well adapt to the changes of evolutional patterns from then on. Thus, we propose a new model, called Complex Evolutional Network (CEN), which uses a length-aware Convolutional Neural Network (CNN) to handle evolutional patterns of different lengths via an easy-to-difficult curriculum learning strategy. Besides, we propose to learn the model under the online setting so that it can adapt to the changes of evolutional patterns over time. Extensive experiments demonstrate that CEN obtains substantial performance improvement under both the traditional offline and the proposed online settings.

* ACL 2022 main conference

Via

Access Paper or Ask Questions

What is Event Knowledge Graph: A Survey

Dec 31, 2021
Saiping Guan, Xueqi Cheng, Long Bai, Fujun Zhang, Zixuan Li, Yutao Zeng, Xiaolong Jin, Jiafeng Guo

Figure 1 for What is Event Knowledge Graph: A Survey

Figure 2 for What is Event Knowledge Graph: A Survey

Figure 3 for What is Event Knowledge Graph: A Survey

Figure 4 for What is Event Knowledge Graph: A Survey

Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many machine learning and artificial intelligence applications, such as intelligent search, question-answering, recommendation, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize perspective directions to facilitate future research on EKG.

Via

Access Paper or Ask Questions

Interpreting Dense Retrieval as Mixture of Topics

Nov 27, 2021
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

Figure 1 for Interpreting Dense Retrieval as Mixture of Topics

Figure 2 for Interpreting Dense Retrieval as Mixture of Topics

Figure 3 for Interpreting Dense Retrieval as Mixture of Topics

Figure 4 for Interpreting Dense Retrieval as Mixture of Topics

Dense Retrieval (DR) reaches state-of-the-art results in first-stage retrieval, but little is known about the mechanisms that contribute to its success. Therefore, in this work, we conduct an interpretation study of recently proposed DR models. Specifically, we first discretize the embeddings output by the document and query encoders. Based on the discrete representations, we analyze the attribution of input tokens. Both qualitative and quantitative experiments are carried out on public test collections. Results suggest that DR models pay attention to different aspects of input and extract various high-level topic representations. Therefore, we can regard the representations learned by DR models as a mixture of high-level topics.

Via

Access Paper or Ask Questions

Pre-training Methods in Information Retrieval

Nov 27, 2021
Yixing Fan, Xiaohui Xie, Yinqiong Cai, Jia Chen, Xinyu Ma, Xiangsheng Li, Ruqing Zhang, Jiafeng Guo, Yiqun Liu

Figure 1 for Pre-training Methods in Information Retrieval

Figure 2 for Pre-training Methods in Information Retrieval

Figure 3 for Pre-training Methods in Information Retrieval

Figure 4 for Pre-training Methods in Information Retrieval

The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to user's information need. Recently, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Since there have been a large number of works dedicating to the application of PTMs in IR, we believe it is the right time to summarize the current status, learn from existing methods, and gain some insights for future development. In this survey, we present an overview of PTMs applied in different components of IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and envision some promising directions, with the hope of inspiring more works on these topics for future research.

Via

Access Paper or Ask Questions

Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Oct 12, 2021
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

Figure 1 for Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Figure 2 for Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Figure 3 for Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Figure 4 for Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Dense Retrieval (DR) has achieved state-of-the-art first-stage ranking effectiveness. However, the efficiency of most existing DR models is limited by the large memory cost of storing dense vectors and the time-consuming nearest neighbor search (NNS) in vector space. Therefore, we present RepCONC, a novel retrieval model that learns discrete Representations via CONstrained Clustering. RepCONC jointly trains dual-encoders and the Product Quantization (PQ) method to learn discrete document representations and enables fast approximate NNS with compact indexes. It models quantization as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and supports end-to-end optimization of the quantization method and dual-encoders. We theoretically demonstrate the importance of the uniform clustering constraint in RepCONC and derive an efficient approximate solution for constrained clustering by reducing it to an instance of the optimal transport problem. Besides constrained clustering, RepCONC further adopts a vector-based inverted file system (IVF) to support highly efficient vector search on CPUs. Extensive experiments on two popular ad-hoc retrieval benchmarks show that RepCONC achieves better ranking effectiveness than competitive vector quantization baselines under different compression ratio settings. It also substantially outperforms a wide range of existing retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency.

* WSDM 2022

Via

Access Paper or Ask Questions