Alert button
Picture for Junwei Pan

Junwei Pan

Alert button

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

Sep 19, 2023
Ximei Wang, Junwei Pan, Xingzhuo Guo, Dapeng Liu, Jie Jiang

Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training(D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.

Viaarxiv icon

STEM: Unleashing the Power of Embeddings for Multi-task Recommendation

Aug 16, 2023
Liangcai Su, Junwei Pan, Ximei Wang, Xi Xiao, Shijie Quan, Xihua Chen, Jie Jiang

Figure 1 for STEM: Unleashing the Power of Embeddings for Multi-task Recommendation
Figure 2 for STEM: Unleashing the Power of Embeddings for Multi-task Recommendation
Figure 3 for STEM: Unleashing the Power of Embeddings for Multi-task Recommendation
Figure 4 for STEM: Unleashing the Power of Embeddings for Multi-task Recommendation

Multi-task learning (MTL) has gained significant popularity in recommendation systems as it enables the simultaneous optimization of multiple objectives. A key challenge in MTL is the occurrence of negative transfer, where the performance of certain tasks deteriorates due to conflicts between tasks. Existing research has explored negative transfer by treating all samples as a whole, overlooking the inherent complexities within them. To this end, we delve into the intricacies of samples by splitting them based on the relative amount of positive feedback among tasks. Surprisingly, negative transfer still occurs in existing MTL methods on samples that receive comparable feedback across tasks. It is worth noting that existing methods commonly employ a shared-embedding paradigm, and we hypothesize that their failure can be attributed to the limited capacity of modeling diverse user preferences across tasks using such universal embeddings. In this paper, we introduce a novel paradigm called Shared and Task-specific EMbeddings (STEM) that aims to incorporate both shared and task-specific embeddings to effectively capture task-specific user preferences. Under this paradigm, we propose a simple model STEM-Net, which is equipped with shared and task-specific embedding tables, along with a customized gating network with stop-gradient operations to facilitate the learning of these embeddings. Remarkably, STEM-Net demonstrates exceptional performance on comparable samples, surpassing the Single-Task Like model and achieves positive transfer. Comprehensive evaluation on three public MTL recommendation datasets demonstrates that STEM-Net outperforms state-of-the-art models by a substantial margin, providing evidence of its effectiveness and superiority.

Viaarxiv icon

Temporal Interest Network for Click-Through Rate Prediction

Aug 15, 2023
Haolin Zhou, Junwei Pan, Xinyi Zhou, Xihua Chen, Jie Jiang, Xiaofeng Gao, Guihai Chen

The history of user behaviors constitutes one of the most significant characteristics in predicting the click-through rate (CTR), owing to their strong semantic and temporal correlation with the target item. While the literature has individually examined each of these correlations, research has yet to analyze them in combination, that is, the quadruple correlation of (behavior semantics, target semantics, behavior temporal, and target temporal). The effect of this correlation on performance and the extent to which existing methods learn it remain unknown. To address this gap, we empirically measure the quadruple correlation and observe intuitive yet robust quadruple patterns. We measure the learned correlation of several representative user behavior methods, but to our surprise, none of them learn such a pattern, especially the temporal one. In this paper, we propose the Temporal Interest Network (TIN) to capture the quadruple semantic and temporal correlation between behaviors and the target. We achieve this by incorporating target-aware temporal encoding, in addition to semantic embedding, to represent behaviors and the target. Furthermore, we deploy target-aware attention, along with target-aware representation, to explicitly conduct the 4-way interaction. We performed comprehensive evaluations on the Amazon and Alibaba datasets. Our proposed TIN outperforms the best-performing baselines by 0.43\% and 0.29\% on two datasets, respectively. Comprehensive analysis and visualization show that TIN is indeed capable of learning the quadruple correlation effectively, while all existing methods fail to do so. We provide our implementation of TIN in Tensorflow.

Viaarxiv icon

ForkMerge: Overcoming Negative Transfer in Multi-Task Learning

Jan 30, 2023
Junguang Jiang, Baixu Chen, Junwei Pan, Ximei Wang, Liu Dapeng, Jie Jiang, Mingsheng Long

Figure 1 for ForkMerge: Overcoming Negative Transfer in Multi-Task Learning
Figure 2 for ForkMerge: Overcoming Negative Transfer in Multi-Task Learning
Figure 3 for ForkMerge: Overcoming Negative Transfer in Multi-Task Learning
Figure 4 for ForkMerge: Overcoming Negative Transfer in Multi-Task Learning

The goal of multi-task learning is to utilize useful knowledge from multiple related tasks to improve the generalization performance of all tasks. However, learning multiple tasks simultaneously often results in worse performance than learning them independently, which is known as negative transfer. Most previous works attribute negative transfer in multi-task learning to gradient conflicts between different tasks and propose several heuristics to manipulate the task gradients for mitigating this problem, which mainly considers the optimization difficulty and overlooks the generalization problem. To fully understand the root cause of negative transfer, we experimentally analyze negative transfer from the perspectives of optimization, generalization, and hypothesis space. Stemming from our analysis, we introduce ForkMerge, which periodically forks the model into multiple branches with different task weights, and merges dynamically to filter out detrimental parameter updates to avoid negative transfer. On a series of multi-task learning tasks, ForkMerge achieves improved performance over state-of-the-art methods and largely avoids negative transfer.

Viaarxiv icon

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

Nov 28, 2022
Enneng Yang, Junwei Pan, Ximei Wang, Haibin Yu, Li Shen, Xihua Chen, Lei Xiao, Jie Jiang, Guibing Guo

Figure 1 for AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning
Figure 2 for AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning
Figure 3 for AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning
Figure 4 for AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning

Multi-task learning (MTL) models have demonstrated impressive results in computer vision, natural language processing, and recommender systems. Even though many approaches have been proposed, how well these approaches balance different tasks on each parameter still remains unclear. In this paper, we propose to measure the task dominance degree of a parameter by the total updates of each task on this parameter. Specifically, we compute the total updates by the exponentially decaying Average of the squared Updates (AU) on a parameter from the corresponding task.Based on this novel metric, we observe that many parameters in existing MTL methods, especially those in the higher shared layers, are still dominated by one or several tasks. The dominance of AU is mainly due to the dominance of accumulative gradients from one or several tasks. Motivated by this, we propose a Task-wise Adaptive learning rate approach, AdaTask in short, to separate the \emph{accumulative gradients} and hence the learning rate of each task for each parameter in adaptive learning rate approaches (e.g., AdaGrad, RMSProp, and Adam). Comprehensive experiments on computer vision and recommender system MTL datasets demonstrate that AdaTask significantly improves the performance of dominated tasks, resulting SOTA average task-wise performance. Analysis on both synthetic and real-world datasets shows AdaTask balance parameters in every shared layer well.

* AAAI 2023 
Viaarxiv icon

AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling

Oct 27, 2022
Zuowu Zheng, Xiaofeng Gao, Junwei Pan, Qi Luo, Guihai Chen, Dapeng Liu, Jie Jiang

Figure 1 for AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling
Figure 2 for AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling
Figure 3 for AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling
Figure 4 for AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling

In Click-through rate (CTR) prediction models, a user's interest is usually represented as a fixed-length vector based on her history behaviors. Recently, several methods are proposed to learn an attentive weight for each user behavior and conduct weighted sum pooling. However, these methods only manually select several fields from the target item side as the query to interact with the behaviors, neglecting the other target item fields, as well as user and context fields. Directly including all these fields in the attention may introduce noise and deteriorate the performance. In this paper, we propose a novel model named AutoAttention, which includes all item/user/context side fields as the query, and assigns a learnable weight for each field pair between behavior fields and query fields. Pruning on these field pairs via these learnable weights lead to automatic field pair selection, so as to identify and remove noisy field pairs. Though including more fields, the computation cost of AutoAttention is still low due to using a simple attention function and field pair selection. Extensive experiments on the public dataset and Tencent's production dataset demonstrate the effectiveness of the proposed approach.

* Accepted by ICDM 2022 
Viaarxiv icon

Trading Hard Negatives and True Negatives: A Debiased Contrastive Collaborative Filtering Approach

Apr 25, 2022
Chenxiao Yang, Qitian Wu, Jipeng Jin, Xiaofeng Gao, Junwei Pan, Guihai Chen

Figure 1 for Trading Hard Negatives and True Negatives: A Debiased Contrastive Collaborative Filtering Approach
Figure 2 for Trading Hard Negatives and True Negatives: A Debiased Contrastive Collaborative Filtering Approach
Figure 3 for Trading Hard Negatives and True Negatives: A Debiased Contrastive Collaborative Filtering Approach
Figure 4 for Trading Hard Negatives and True Negatives: A Debiased Contrastive Collaborative Filtering Approach

Collaborative filtering (CF), as a standard method for recommendation with implicit feedback, tackles a semi-supervised learning problem where most interaction data are unobserved. Such a nature makes existing approaches highly rely on mining negatives for providing correct training signals. However, mining proper negatives is not a free lunch, encountering with a tricky trade-off between mining informative hard negatives and avoiding false ones. We devise a new approach named as Hardness-Aware Debiased Contrastive Collaborative Filtering (HDCCF) to resolve the dilemma. It could sufficiently explore hard negatives from two-fold aspects: 1) adaptively sharpening the gradients of harder instances through a set-wise objective, and 2) implicitly leveraging item/user frequency information with a new sampling strategy. To circumvent false negatives, we develop a principled approach to improve the reliability of negative instances and prove that the objective is an unbiased estimation of sampling from the true negative distribution. Extensive experiments demonstrate the superiority of the proposed model over existing CF models and hard negative mining methods.

* in IJCAI 2022 
Viaarxiv icon

Impression Allocation and Policy Search in Display Advertising

Mar 11, 2022
Di Wu, Cheng Chen, Xiujun Chen, Junwei Pan, Xun Yang, Qing Tan, Jian Xu, Kuang-Chih Lee

Figure 1 for Impression Allocation and Policy Search in Display Advertising
Figure 2 for Impression Allocation and Policy Search in Display Advertising
Figure 3 for Impression Allocation and Policy Search in Display Advertising
Figure 4 for Impression Allocation and Policy Search in Display Advertising

In online display advertising, guaranteed contracts and real-time bidding (RTB) are two major ways to sell impressions for a publisher. For large publishers, simultaneously selling impressions through both guaranteed contracts and in-house RTB has become a popular choice. Generally speaking, a publisher needs to derive an impression allocation strategy between guaranteed contracts and RTB to maximize its overall outcome (e.g., revenue and/or impression quality). However, deriving the optimal strategy is not a trivial task, e.g., the strategy should encourage incentive compatibility in RTB and tackle common challenges in real-world applications such as unstable traffic patterns (e.g., impression volume and bid landscape changing). In this paper, we formulate impression allocation as an auction problem where each guaranteed contract submits virtual bids for individual impressions. With this formulation, we derive the optimal bidding functions for the guaranteed contracts, which result in the optimal impression allocation. In order to address the unstable traffic pattern challenge and achieve the optimal overall outcome, we propose a multi-agent reinforcement learning method to adjust the bids from each guaranteed contract, which is simple, converging efficiently and scalable. The experiments conducted on real-world datasets demonstrate the effectiveness of our method.

Viaarxiv icon

Cross-Task Knowledge Distillation in Multi-Task Recommendation

Feb 20, 2022
Chenxiao Yang, Junwei Pan, Xiaofeng Gao, Tingyu Jiang, Dapeng Liu, Guihai Chen

Figure 1 for Cross-Task Knowledge Distillation in Multi-Task Recommendation
Figure 2 for Cross-Task Knowledge Distillation in Multi-Task Recommendation
Figure 3 for Cross-Task Knowledge Distillation in Multi-Task Recommendation
Figure 4 for Cross-Task Knowledge Distillation in Multi-Task Recommendation

Multi-task learning has been widely used in real-world recommenders to predict different types of user feedback. Most prior works focus on designing network architectures for bottom layers as a means to share the knowledge about input features representations. However, since they adopt task-specific binary labels as supervised signals for training, the knowledge about how to accurately rank items is not fully shared across tasks. In this paper, we aim to enhance knowledge transfer for multi-task personalized recommendat optimization objectives. We propose a Cross-Task Knowledge Distillation (CrossDistil) framework in recommendation, which consists of three procedures. 1) Task Augmentation: We introduce auxiliary tasks with quadruplet loss functions to capture cross-task fine-grained ranking information, which could avoid task conflicts by preserving the cross-task consistent knowledge; 2) Knowledge Distillation: We design a knowledge distillation approach based on augmented tasks for sharing ranking knowledge, where tasks' predictions are aligned with a calibration process; 3) Model Training: Teacher and student models are trained in an end-to-end manner, with a novel error correction mechanism to speed up model training and improve knowledge quality. Comprehensive experiments on a public dataset and our production dataset are carried out to verify the effectiveness of CrossDistil as well as the necessity of its key components.

* AAAI 2022 
Viaarxiv icon

Follow the Prophet: Accurate Online Conversion Rate Prediction in the Face of Delayed Feedback

Aug 13, 2021
Haoming Li, Feiyang Pan, Xiang Ao, Zhao Yang, Min Lu, Junwei Pan, Dapeng Liu, Lei Xiao, Qing He

Figure 1 for Follow the Prophet: Accurate Online Conversion Rate Prediction in the Face of Delayed Feedback
Figure 2 for Follow the Prophet: Accurate Online Conversion Rate Prediction in the Face of Delayed Feedback

The delayed feedback problem is one of the imperative challenges in online advertising, which is caused by the highly diversified feedback delay of a conversion varying from a few minutes to several days. It is hard to design an appropriate online learning system under these non-identical delay for different types of ads and users. In this paper, we propose to tackle the delayed feedback problem in online advertising by "Following the Prophet" (FTP for short). The key insight is that, if the feedback came instantly for all the logged samples, we could get a model without delayed feedback, namely the "prophet". Although the prophet cannot be obtained during online learning, we show that we could predict the prophet's predictions by an aggregation policy on top of a set of multi-task predictions, where each task captures the feedback patterns of different periods. We propose the objective and optimization approach for the policy, and use the logged data to imitate the prophet. Extensive experiments on three real-world advertising datasets show that our method outperforms the previous state-of-the-art baselines.

* In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21), July 11--15, 2021, Virtual Event, Canada. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3404835.3463045 
Viaarxiv icon