Alert button
Picture for Chenyan Xiong

Chenyan Xiong

Alert button

Microsoft Research

Distributionally Robust Unsupervised Dense Retrieval Training on Web Graphs

Oct 26, 2023
Peixuan Han, Zhenghao Liu, Zhiyuan Liu, Chenyan Xiong

This paper introduces Web-DRO, an unsupervised dense retrieval model, which clusters documents based on web structures and reweights the groups during contrastive training. Specifically, we first leverage web graph links and contrastively train an embedding model for clustering anchor-document pairs. Then we use Group Distributional Robust Optimization to reweight different clusters of anchor-document pairs, which guides the model to assign more weights to the group with higher contrastive loss and pay more attention to the worst case during training. Our experiments on MS MARCO and BEIR show that our model, Web-DRO, significantly improves the retrieval effectiveness in unsupervised scenarios. A comparison of clustering techniques shows that training on the web graph combining URL information reaches optimal performance on clustering. Further analysis confirms that group weights are stable and valid, indicating consistent model preferences as well as effective up-weighting of valuable groups and down-weighting of uninformative ones. The code of this paper can be obtained from https://github.com/OpenMatch/Web-DRO.

* 9 pages, 5 figures, 5 tables 
Viaarxiv icon

Unlock Multi-Modal Capability of Dense Retrieval via Visual Module Plugin

Oct 21, 2023
Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Yu Gu, Ge Yu

This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL) to learn an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of a well-trained dense retriever, T5-ANCE, by incorporating the image features encoded by the visual module as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and exact the related texts and image documents from anchor linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the visual module plugin method is tailored to enable the image understanding ability for an existing dense retrieval model. Besides, we also show that the language model has the ability to extract image semantics from image encoders and adapt the image features in the input space of language models. All codes are available at https://github.com/OpenMatch/MARVEL.

Viaarxiv icon

Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model

Oct 08, 2023
Cheng Qian, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu

Figure 1 for Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model
Figure 2 for Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model
Figure 3 for Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model
Figure 4 for Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model

Large Language Models (LLMs) have demonstrated remarkable progress in utilizing tools, but their closed-source nature and high inference costs pose limitations on their adaptability, necessitating a valid method that leverages smaller, open-sourced models. In this paper, we introduce Toolink, a comprehensive framework that performs task-solving by first creating a toolkit and then integrating the planning and calling of tools through a chain-of-solving (CoS) approach. We first validate the efficacy of Toolink in harnessing the model's creativity and CoS ability on ChatGPT. Subsequently, we curate CoS-GPT, a chain-of-solving dataset designed for tool-using, and finetune the LLaMA-7B model. It results in LLaMA-CoS, a powerful open-source model with advanced tool-planning and tool-calling capabilities. Evaluation on diverse tasks from BIG-bench demonstrates its CoS ability matches that of ChatGPT while its performance surpasses the chain-of-thought approach. Further studies highlight the generalization of LLaMA-CoS to unseen tasks and showcase its capability in using toolkits not explicitly tailored for the target task, affirming its robustness in real-world scenarios. All codes and data are released.

Viaarxiv icon

Text Matching Improves Sequential Recommendation by Reducing Popularity Biases

Aug 27, 2023
Zhenghao Liu, Sen Mei, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, Ge Yu

Figure 1 for Text Matching Improves Sequential Recommendation by Reducing Popularity Biases
Figure 2 for Text Matching Improves Sequential Recommendation by Reducing Popularity Biases
Figure 3 for Text Matching Improves Sequential Recommendation by Reducing Popularity Biases
Figure 4 for Text Matching Improves Sequential Recommendation by Reducing Popularity Biases

This paper proposes Text mAtching based SequenTial rEcommendation model (TASTE), which maps items and users in an embedding space and recommends items by matching their text representations. TASTE verbalizes items and user-item interactions using identifiers and attributes of items. To better characterize user behaviors, TASTE additionally proposes an attention sparsity method, which enables TASTE to model longer user-item interactions by reducing the self-attention computations during encoding. Our experiments show that TASTE outperforms the state-of-the-art methods on widely used sequential recommendation datasets. TASTE alleviates the cold start problem by representing long-tail items using full-text modeling and bringing the benefits of pretrained language models to recommendation systems. Our further analyses illustrate that TASTE significantly improves the recommendation accuracy by reducing the popularity bias of previous item id based recommendation models and returning more appropriate and text-relevant items to satisfy users. All codes are available at https://github.com/OpenMatch/TASTE.

* Accepted by CIKM 2023 
Viaarxiv icon

Improving Multitask Retrieval by Promoting Task Specialization

Jul 01, 2023
Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk

Figure 1 for Improving Multitask Retrieval by Promoting Task Specialization
Figure 2 for Improving Multitask Retrieval by Promoting Task Specialization
Figure 3 for Improving Multitask Retrieval by Promoting Task Specialization
Figure 4 for Improving Multitask Retrieval by Promoting Task Specialization

In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval in which a separate retriever is trained for each task. We show that it is possible to train a multitask retriever that outperforms task-specific retrievers by promoting task specialization. The main ingredients are: (1) a better choice of pretrained model (one that is explicitly optimized for multitasking) along with compatible prompting, and (2) a novel adaptive learning method that encourages each parameter to specialize in a particular task. The resulting multitask retriever is highly performant on the KILT benchmark. Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.

* TACL 2023 
Viaarxiv icon

Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data

May 31, 2023
Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, Ge Yu

Figure 1 for Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Figure 2 for Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Figure 3 for Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Figure 4 for Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data

This paper presents Structure Aware Dense Retrieval (SANTA) model, which encodes user queries and structured data in one universal embedding space for retrieving structured data. SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining. It contrastively trains language models to represent multi-modal text data and teaches models to distinguish matched structured data for unstructured texts. 2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities. Our experiments show that SANTA achieves state-of-the-art on code search and product search and conducts convincing results in the zero-shot setting. SANTA learns tailored representations for multi-modal text data by aligning structured and unstructured data pairs and capturing structural semantics by masking and predicting entities in the structured data. All codes are available at https://github.com/OpenMatch/OpenMatch.

* Accepted by Findings of ACL 2023 
Viaarxiv icon

Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In

May 27, 2023
Zichun Yu, Chenyan Xiong, Shi Yu, Zhiyuan Liu

Figure 1 for Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In
Figure 2 for Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In
Figure 3 for Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In
Figure 4 for Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In

Retrieval augmentation can aid language models (LMs) in knowledge-intensive tasks by supplying them with external information. Prior works on retrieval augmentation usually jointly fine-tune the retriever and the LM, making them closely coupled. In this paper, we explore the scheme of generic retrieval plug-in: the retriever is to assist target LMs that may not be known beforehand or are unable to be fine-tuned together. To retrieve useful documents for unseen target LMs, we propose augmentation-adapted retriever (AAR), which learns LM's preferences obtained from a known source LM. Experiments on the MMLU and PopQA datasets demonstrate that our AAR trained with a small source LM is able to significantly improve the zero-shot generalization of larger target LMs ranging from 250M Flan-T5 to 175B InstructGPT. Further analysis indicates that the preferences of different LMs overlap, enabling AAR trained with a single source LM to serve as a generic plug-in for various target LMs. Our code is open-sourced at https://github.com/OpenMatch/Augmentation-Adapted-Retriever.

* Accepted to ACL 2023 
Viaarxiv icon

Fusion-in-T5: Unifying Document Ranking Signals for Improved Information Retrieval

May 24, 2023
Shi Yu, Chenghao Fan, Chenyan Xiong, David Jin, Zhiyuan Liu, Zhenghao Liu

Figure 1 for Fusion-in-T5: Unifying Document Ranking Signals for Improved Information Retrieval
Figure 2 for Fusion-in-T5: Unifying Document Ranking Signals for Improved Information Retrieval
Figure 3 for Fusion-in-T5: Unifying Document Ranking Signals for Improved Information Retrieval
Figure 4 for Fusion-in-T5: Unifying Document Ranking Signals for Improved Information Retrieval

Common IR pipelines are typically cascade systems that may involve multiple rankers and/or fusion models to integrate different information step-by-step. In this paper, we propose a novel re-ranker named Fusion-in-T5 (FiT5), which integrates document text information, retrieval features, and global document information into a single unified model using templated-based input and global attention. Experiments on passage ranking benchmarks MS MARCO and TREC DL show that FiT5 significantly improves ranking performance over prior pipelines. Analyses find that through global attention, FiT5 is able to jointly utilize the ranking features via gradually attending to related documents, and thus improve the detection of subtle nuances between them. Our code will be open-sourced.

Viaarxiv icon

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

May 21, 2023
Linyuan Gong, Chenyan Xiong, Xiaodong Liu, Payal Bajaj, Yiqing Xie, Alvin Cheung, Jianfeng Gao, Xia Song

Figure 1 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers
Figure 2 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers
Figure 3 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers
Figure 4 for Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5. We study various designs to pretrain T5 using an auxiliary model to construct more challenging token replacements for the main model to denoise. Key aspects under study include the decoding target, the location of the RTD head, and the masking pattern. Based on these studies, we develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks. METRO-T0 outperforms all similar-sized baselines on prompted NLP benchmarks, such as T0 Eval and MMLU, and rivals the state-of-the-art T0-11B model with only 8% of its parameters. Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity. The code and model checkpoints are available at https://github.com/gonglinyuan/metro_t0.

* Published as a conference paper at ACL 2023. 9 pages 
Viaarxiv icon

Unsupervised Dense Retrieval Training with Web Anchors

May 10, 2023
Yiqing Xie, Xiao Liu, Chenyan Xiong

Figure 1 for Unsupervised Dense Retrieval Training with Web Anchors
Figure 2 for Unsupervised Dense Retrieval Training with Web Anchors
Figure 3 for Unsupervised Dense Retrieval Training with Web Anchors
Figure 4 for Unsupervised Dense Retrieval Training with Web Anchors

In this work, we present an unsupervised retrieval method with contrastive learning on web anchors. The anchor text describes the content that is referenced from the linked page. This shows similarities to search queries that aim to retrieve pertinent information from relevant documents. Based on their commonalities, we train an unsupervised dense retriever, Anchor-DR, with a contrastive learning task that matches the anchor text and the linked document. To filter out uninformative anchors (such as ``homepage'' or other functional anchors), we present a novel filtering technique to only select anchors that contain similar types of information as search queries. Experiments show that Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval by a large margin (e.g., by 5.3% NDCG@10 on MSMARCO). The gain of our method is especially significant for search and question answering tasks. Our analysis further reveals that the pattern of anchor-document pairs is similar to that of search query-document pairs. Code available at https://github.com/Veronicium/AnchorDR.

* SIGIR'23 Short 
Viaarxiv icon