Alert button
Picture for Ming Gong

Ming Gong

Alert button

RUEL: Retrieval-Augmented User Representation with Edge Browser Logs for Sequential Recommendation

Sep 19, 2023
Ning Wu, Ming Gong, Linjun Shou, Jian Pei, Daxin Jiang

Online recommender systems (RS) aim to match user needs with the vast amount of resources available on various platforms. A key challenge is to model user preferences accurately under the condition of data sparsity. To address this challenge, some methods have leveraged external user behavior data from multiple platforms to enrich user representation. However, all of these methods require a consistent user ID across platforms and ignore the information from similar users. In this study, we propose RUEL, a novel retrieval-based sequential recommender that can effectively incorporate external anonymous user behavior data from Edge browser logs to enhance recommendation. We first collect and preprocess a large volume of Edge browser logs over a one-year period and link them to target entities that correspond to candidate items in recommendation datasets. We then design a contrastive learning framework with a momentum encoder and a memory bank to retrieve the most relevant and diverse browsing sequences from the full browsing log based on the semantic similarity between user representations. After retrieval, we apply an item-level attentive selector to filter out noisy items and generate refined sequence embeddings for the final predictor. RUEL is the first method that connects user browsing data with typical recommendation datasets and can be generalized to various recommendation scenarios and datasets. We conduct extensive experiments on four real datasets for sequential recommendation tasks and demonstrate that RUEL significantly outperforms state-of-the-art baselines. We also conduct ablation studies and qualitative analysis to validate the effectiveness of each component of RUEL and provide additional insights into our method.

* CIKM 2023 ADS 
Viaarxiv icon

Alleviating Over-smoothing for Unsupervised Sentence Representation

May 09, 2023
Nuo Chen, Linjun Shou, Ming Gong, Jian Pei, Bowen Cao, Jianhui Chang, Daxin Jiang, Jia Li

Figure 1 for Alleviating Over-smoothing for Unsupervised Sentence Representation
Figure 2 for Alleviating Over-smoothing for Unsupervised Sentence Representation
Figure 3 for Alleviating Over-smoothing for Unsupervised Sentence Representation
Figure 4 for Alleviating Over-smoothing for Unsupervised Sentence Representation

Currently, learning better unsupervised sentence representations is the pursuit of many natural language processing communities. Lots of approaches based on pre-trained language models (PLMs) and contrastive learning have achieved promising results on this task. Experimentally, we observe that the over-smoothing problem reduces the capacity of these powerful PLMs, leading to sub-optimal sentence representations. In this paper, we present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue, which samples negatives from PLMs intermediate layers, improving the quality of the sentence representation. Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting, which can be seen as a plug-and-play contrastive framework for learning unsupervised sentence representation. Extensive results prove that SSCL brings the superior performance improvements of different strong baselines (e.g., BERT and SimCSE) on Semantic Textual Similarity and Transfer datasets. Our codes are available at https://github.com/nuochenpku/SSCL.

* ACL 2023  
* 13 pages 
Viaarxiv icon

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Apr 17, 2023
Shengyao Zhuang, Linjun Shou, Jian Pei, Ming Gong, Houxing Ren, Guido Zuccon, Daxin Jiang

Figure 1 for Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval
Figure 2 for Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval
Figure 3 for Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval
Figure 4 for Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel \textit{pre-training} strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

* 10 pages 
Viaarxiv icon

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Mar 29, 2023
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, Nan Duan

Figure 1 for TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
Figure 2 for TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
Figure 3 for TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
Figure 4 for TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.

Viaarxiv icon

Large Language Models are Diverse Role-Players for Summarization Evaluation

Mar 28, 2023
Ning Wu, Ming Gong, Linjun Shou, Shining Liang, Daxin Jiang

Figure 1 for Large Language Models are Diverse Role-Players for Summarization Evaluation
Figure 2 for Large Language Models are Diverse Role-Players for Summarization Evaluation
Figure 3 for Large Language Models are Diverse Role-Players for Summarization Evaluation
Figure 4 for Large Language Models are Diverse Role-Players for Summarization Evaluation

Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. For example, the quality of a document summary can be measured by human annotators from both objective aspects, such as grammatical and semantic correctness, as well as subjective dimensions, such as comprehensiveness, succinctness, and interestingness. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to capture the above dimensions well. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting to integrate multiple evaluation results into evaluation results. Experimental results on two real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.

Viaarxiv icon

Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

Mar 27, 2023
Houxing Ren, Linjun Shou, Ning Wu, Ming Gong, Daxin Jiang

Figure 1 for Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval
Figure 2 for Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval
Figure 3 for Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval
Figure 4 for Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

In monolingual dense retrieval, lots of works focus on how to distill knowledge from cross-encoder re-ranker to dual-encoder retriever and these methods achieve better performance due to the effectiveness of cross-encoder re-ranker. However, we find that the performance of the cross-encoder re-ranker is heavily influenced by the number of training samples and the quality of negative samples, which is hard to obtain in the cross-lingual setting. In this paper, we propose to use a query generator as the teacher in the cross-lingual setting, which is less dependent on enough training samples and high-quality negative samples. In addition to traditional knowledge distillation, we further propose a novel enhancement method, which uses the query generator to help the dual-encoder align queries from different languages, but does not need any additional parallel sentences. The experimental results show that our method outperforms the state-of-the-art methods on two benchmark datasets.

* EMNLP 2022 main conference 
Viaarxiv icon

Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

Mar 27, 2023
Houxing Ren, Linjun Shou, Jian Pei, Ning Wu, Ming Gong, Daxin Jiang

Figure 1 for Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval
Figure 2 for Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval
Figure 3 for Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval
Figure 4 for Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.

* EMNLP 2022 Findings 
Viaarxiv icon

Bridge the Gap between Language models and Tabular Understanding

Feb 16, 2023
Nuo Chen, Linjun Shou, Ming Gong, Jian Pei, Chenyu You, Jianhui Chang, Daxin Jiang, Jia Li

Figure 1 for Bridge the Gap between Language models and Tabular Understanding
Figure 2 for Bridge the Gap between Language models and Tabular Understanding
Figure 3 for Bridge the Gap between Language models and Tabular Understanding
Figure 4 for Bridge the Gap between Language models and Tabular Understanding

Table pretrain-then-finetune paradigm has been proposed and employed at a rapid pace after the success of pre-training in the natural language domain. Despite the promising findings in tabular pre-trained language models (TPLMs), there is an input gap between pre-training and fine-tuning phases. For instance, TPLMs jointly pre-trained with table and text input could be effective for tasks also with table-text joint input like table question answering, but it may fail for tasks with only tables or text as input such as table retrieval. To this end, we propose UTP, an approach that dynamically supports three types of multi-modal inputs: table-text, table, and text. Specifically, UTP is pre-trained with two strategies: (1) We first utilize a universal mask language modeling objective on each kind of input, enforcing the model to adapt various inputs. (2) We then present Cross-Modal Contrastive Regularization (CMCR), which utilizes contrastive learning to encourage the consistency between table-text cross-modality representations via unsupervised instance-wise training signals during pre-training. By these means, the resulting model not only bridges the input gap between pre-training and fine-tuning but also advances in the alignment of table and text. Extensive results show UTP achieves superior results on uni-modal input tasks (e.g., table retrieval) and cross-modal input tasks (e.g., table question answering).

* 7 pages 
Viaarxiv icon

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Feb 03, 2023
Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, Nan Duan

Figure 1 for Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval
Figure 2 for Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval
Figure 3 for Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval
Figure 4 for Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.

* Published at ICLR 2023 
Viaarxiv icon