Multi-behavior recommendation, which exploits auxiliary behaviors (e.g., click and cart) to help predict users' potential interactions on the target behavior (e.g., buy), is regarded as an effective way to alleviate the data sparsity or cold-start issues in recommendation. Multi-behaviors are often taken in certain orders in real-world applications (e.g., click>cart>buy). In a behavior chain, a latter behavior usually exhibits a stronger signal of user preference than the former one does. Most existing multi-behavior models fail to capture such dependencies in a behavior chain for embedding learning. In this work, we propose a novel multi-behavior recommendation model with cascading graph convolution networks (named MB-CGCN). In MB-CGCN, the embeddings learned from one behavior are used as the input features for the next behavior's embedding learning after a feature transformation operation. In this way, our model explicitly utilizes the behavior dependencies in embedding learning. Experiments on two benchmark datasets demonstrate the effectiveness of our model on exploiting multi-behavior data. It outperforms the best baseline by 33.7% and 35.9% on average over the two datasets in terms of Recall@10 and NDCG@10, respectively.
The Diffusion Probabilistic Model (DPM) has emerged as a highly effective generative model in the field of computer vision. Its intermediate latent vectors offer rich semantic information, making it an attractive option for various downstream tasks such as segmentation and detection. In order to explore its potential further, we have taken a step forward and considered a more complex scenario in the medical image domain, specifically, under an unsupervised adaptation condition. To this end, we propose a Diffusion-based and Prototype-guided network (DP-Net) for unsupervised domain adaptive segmentation. Concretely, our DP-Net consists of two stages: 1) Distribution Aligned Diffusion (DADiff), which involves training a domain discriminator to minimize the difference between the intermediate features generated by the DPM, thereby aligning the inter-domain distribution; and 2) Prototype-guided Consistency Learning (PCL), which utilizes feature centroids as prototypes and applies a prototype-guided loss to ensure that the segmentor learns consistent content from both source and target domains. Our approach is evaluated on fundus datasets through a series of experiments, which demonstrate that the performance of the proposed method is reliable and outperforms state-of-the-art methods. Our work presents a promising direction for using DPM in complex medical image scenarios, opening up new possibilities for further research in medical imaging.
In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Notably, Neural Preset enables stable 4K color style transfer in real-time without artifacts. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization. Project page with demos: https://zhkkke.github.io/NeuralPreset .
When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method.
Diffusion Probabilistic Models have recently shown remarkable performance in generative image modeling, attracting significant attention in the computer vision community. However, while a substantial amount of diffusion-based research has focused on generative tasks, few studies have applied diffusion models to general medical image classification. In this paper, we propose the first diffusion-based model (named DiffMIC) to address general medical image classification by eliminating unexpected noise and perturbations in medical images and robustly capturing semantic representation. To achieve this goal, we devise a dual conditional guidance strategy that conditions each diffusion step with multiple granularities to improve step-wise regional attention. Furthermore, we propose learning the mutual information in each granularity by enforcing Maximum-Mean Discrepancy regularization during the diffusion forward process. We evaluate the effectiveness of our DiffMIC on three medical classification tasks with different image modalities, including placental maturity grading on ultrasound images, skin lesion classification using dermatoscopic images, and diabetic retinopathy grading using fundus images. Our experimental results demonstrate that DiffMIC outperforms state-of-the-art methods by a significant margin, indicating the universality and effectiveness of the proposed model. Our code will be publicly available at https://github.com/scott-yjyang/DiffMIC.
Masked image modeling (MIM) with transformer backbones has recently been exploited as a powerful self-supervised pre-training technique. The existing MIM methods adopt the strategy to mask random patches of the image and reconstruct the missing pixels, which only considers semantic information at a lower level, and causes a long pre-training time.This paper presents HybridMIM, a novel hybrid self-supervised learning method based on masked image modeling for 3D medical image segmentation.Specifically, we design a two-level masking hierarchy to specify which and how patches in sub-volumes are masked, effectively providing the constraints of higher level semantic information. Then we learn the semantic information of medical images at three levels, including:1) partial region prediction to reconstruct key contents of the 3D image, which largely reduces the pre-training time burden (pixel-level); 2) patch-masking perception to learn the spatial relationship between the patches in each sub-volume (region-level).and 3) drop-out-based contrastive learning between samples within a mini-batch, which further improves the generalization ability of the framework (sample-level). The proposed framework is versatile to support both CNN and transformer as encoder backbones, and also enables to pre-train decoders for image segmentation. We conduct comprehensive experiments on four widely-used public medical image segmentation datasets, including BraTS2020, BTCV, MSD Liver, and MSD Spleen. The experimental results show the clear superiority of HybridMIM against competing supervised methods, masked pre-training approaches, and other self-supervised methods, in terms of quantitative metrics, timing performance and qualitative observations. The codes of HybridMIM are available at https://github.com/ge-xing/HybridMIM
In recent years, Denoising Diffusion Models have demonstrated remarkable success in generating semantically valuable pixel-wise representations for image generative modeling. In this study, we propose a novel end-to-end framework, called Diff-UNet, for medical volumetric segmentation. Our approach integrates the diffusion model into a standard U-shaped architecture to extract semantic information from the input volume effectively, resulting in excellent pixel-level representations for medical volumetric segmentation. To enhance the robustness of the diffusion model's prediction results, we also introduce a Step-Uncertainty based Fusion (SUF) module during inference to combine the outputs of the diffusion models at each step. We evaluate our method on three datasets, including multimodal brain tumors in MRI, liver tumors, and multi-organ CT volumes, and demonstrate that Diff-UNet outperforms other state-of-the-art methods significantly. Our experimental results also indicate the universality and effectiveness of the proposed model. The proposed framework has the potential to facilitate the accurate diagnosis and treatment of medical conditions by enabling more precise segmentation of anatomical structures. The codes of Diff-UNet are available at https://github.com/ge-xing/Diff-UNet
Video dehazing aims to recover haze-free frames with high visibility and contrast. This paper presents a novel framework to effectively explore the physical haze priors and aggregate temporal information. Specifically, we design a memory-based physical prior guidance module to encode the prior-related features into long-range memory. Besides, we formulate a multi-range scene radiance recovery module to capture space-time dependencies in multiple space-time ranges, which helps to effectively aggregate temporal information from adjacent frames. Moreover, we construct the first large-scale outdoor video dehazing benchmark dataset, which contains videos in various real-world scenarios. Experimental results on both synthetic and real conditions show the superiority of our proposed method.