The sequential recommendation task aims to predict the item that user is interested in according to his/her historical action sequence. However, inevitable random action, i.e. user randomly accesses an item among multiple candidates or clicks several items at random order, cause the sequence fails to provide stable and high-quality signals. To alleviate the issue, we propose the StatisTics-Driven Pre-traing framework (called STDP briefly). The main idea of the work lies in the exploration of utilizing the statistics information along with the pre-training paradigm to stabilize the optimization of recommendation model. Specifically, we derive two types of statistical information: item co-occurrence across sequence and attribute frequency within the sequence. And we design the following pre-training tasks: 1) The co-occurred items prediction task, which encourages the model to distribute its attention on multiple suitable targets instead of just focusing on the next item that may be unstable. 2) We generate a paired sequence by replacing items with their co-occurred items and enforce its representation close with the original one, thus enhancing the model's robustness to the random noise. 3) To reduce the impact of random on user's long-term preferences, we encourage the model to capture sequence-level frequent attributes. The significant improvement over six datasets demonstrates the effectiveness and superiority of the proposal, and further analysis verified the generalization of the STDP framework on other models.
Instruction tuning has proven essential for enhancing the performance of large language models (LLMs) in generating human-aligned responses. However, collecting diverse, high-quality instruction data for tuning poses challenges, particularly in privacy-sensitive domains. Federated instruction tuning (FedIT) has emerged as a solution, leveraging federated learning from multiple data owners while preserving privacy. Yet, it faces challenges due to limited instruction data and vulnerabilities to training data extraction attacks. To address these issues, we propose a novel federated algorithm, FedPIT, which utilizes LLMs' in-context learning capability to self-generate task-specific synthetic data for training autonomously. Our method employs parameter-isolated training to maintain global parameters trained on synthetic data and local parameters trained on augmented local data, effectively thwarting data extraction attacks. Extensive experiments on real-world medical data demonstrate the effectiveness of FedPIT in improving federated few-shot performance while preserving privacy and robustness against data heterogeneity.
Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information. Despite these advancements, previous research found that LLMs sometimes struggle with adhering to specific constraints (e.g., in specific place or at specific time), at times even overlooking them, which leads to responses that are either too generic or not fully satisfactory. Existing approaches attempted to address this issue by decomposing or rewriting input instructions, yet they fall short in adequately emphasizing specific constraints and in unlocking the underlying knowledge (e.g., programming within the context of software development). In response, this paper proposes a simple yet effective method named Chain-of-Specificity (CoS). Specifically, CoS iteratively emphasizes the specific constraints in the input instructions, unlocks knowledge within LLMs, and refines responses. Experiments conducted on publicly available and self-build complex datasets demonstrate that CoS outperforms existing methods in enhancing generated content especially for the specificity. Besides, as the number of specific constraints increase, other baselines falter, while CoS still performs well. Moreover, we show that distilling responses generated by CoS effectively enhances the ability of smaller models to follow the constrained instructions. Resources of this paper will be released for further research.
Burst super-resolution (BurstSR) aims at reconstructing a high-resolution (HR) image from a sequence of low-resolution (LR) and noisy images, which is conducive to enhancing the imaging effects of smartphones with limited sensors. The main challenge of BurstSR is to effectively combine the complementary information from input frames, while existing methods still struggle with it. In this paper, we suggest fusing cues frame-by-frame with an efficient and flexible recurrent network. In particular, we emphasize the role of the base-frame and utilize it as a key prompt to guide the knowledge acquisition from other frames in every recurrence. Moreover, we introduce an implicit weighting loss to improve the model's flexibility in facing input frames with variable numbers. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves better results than state-of-the-art ones. Codes and pre-trained models are available at https://github.com/ZcsrenlongZ/RBSR.
Natural videos captured by consumer cameras often suffer from low framerate and motion blur due to the combination of dynamic scene complexity, lens and sensor imperfection, and less than ideal exposure setting. As a result, computational methods that jointly perform video frame interpolation and deblurring begin to emerge with the unrealistic assumption that the exposure time is known and fixed. In this work, we aim ambitiously for a more realistic and challenging task - joint video multi-frame interpolation and deblurring under unknown exposure time. Toward this goal, we first adopt a variant of supervised contrastive learning to construct an exposure-aware representation from input blurred frames. We then train two U-Nets for intra-motion and inter-motion analysis, respectively, adapting to the learned exposure representation via gain tuning. We finally build our video reconstruction network upon the exposure and motion representation by progressive exposure-adaptive convolution and motion refinement. Extensive experiments on both simulated and real-world datasets show that our optimized method achieves notable performance gains over the state-of-the-art on the joint video x8 interpolation and deblurring task. Moreover, on the seemingly implausible x16 interpolation task, our method outperforms existing methods by more than 1.5 dB in terms of PSNR.
Recently, deep learning-based image denoising methods have achieved promising performance on test data with the same distribution as training set, where various denoising models based on synthetic or collected real-world training data have been learned. However, when handling real-world noisy images, the denoising performance is still limited. In this paper, we propose a simple yet effective Bayesian deep ensemble (BDE) method for real-world image denoising, where several representative deep denoisers pre-trained with various training data settings can be fused to improve robustness. The foundation of BDE is that real-world image noises are highly signal-dependent, and heterogeneous noises in a real-world noisy image can be separately handled by different denoisers. In particular, we take well-trained CBDNet, NBNet, HINet, Uformer and GMSNet into denoiser pool, and a U-Net is adopted to predict pixel-wise weighting maps to fuse these denoisers. Instead of solely learning pixel-wise weighting maps, Bayesian deep learning strategy is introduced to predict weighting uncertainty as well as weighting map, by which prediction variance can be modeled for improving robustness on real-world noisy images. Extensive experiments have shown that real-world noises can be better removed by fusing existing denoisers instead of training a big denoiser with expensive cost. On DND dataset, our BDE achieves +0.28~dB PSNR gain over the state-of-the-art denoising method. Moreover, we note that our BDE denoiser based on different Gaussian noise levels outperforms state-of-the-art CBDNet when applying to real-world noisy images. Furthermore, our BDE can be extended to other image restoration tasks, and achieves +0.30dB, +0.18dB and +0.12dB PSNR gains on benchmark datasets for image deblurring, image deraining and single image super-resolution, respectively.
The study of multi-task learning has drawn great attention from the community. Despite the remarkable progress, the challenge of optimally learning different tasks simultaneously remains to be explored. Previous works attempt to modify the gradients from different tasks. Yet these methods give a subjective assumption of the relationship between tasks, and the modified gradient may be less accurate. In this paper, we introduce Stochastic Task Allocation~(STA), a mechanism that addresses this issue by a task allocation approach, in which each sample is randomly allocated a subset of tasks. For further progress, we propose Interleaved Stochastic Task Allocation~(ISTA) to iteratively allocate all tasks to each example during several consecutive iterations. We evaluate STA and ISTA on various datasets and applications: NYUv2, Cityscapes, and COCO for scene understanding and instance segmentation. Our experiments show both STA and ISTA outperform current state-of-the-art methods. The code will be available.
In this paper, we consider two challenging issues in reference-based super-resolution (RefSR), (i) how to choose a proper reference image, and (ii) how to learn real-world RefSR in a self-supervised manner. Particularly, we present a novel self-supervised learning approach for real-world image SR from observations at dual camera zooms (SelfDZSR). For the first issue, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the SR of the lesser zoomed (short-focus) image. For the second issue, SelfDZSR learns a deep network to obtain the SR result of short-focal image and with the same resolution as the telephoto image. For this purpose, we take the telephoto image instead of an additional high-resolution image as the supervision information and select a patch from it as the reference to super-resolve the corresponding short-focus image patch. To mitigate the effect of various misalignment between the short-focus low-resolution (LR) image and telephoto ground-truth (GT) image, we design a degradation model and map the GT to a pseudo-LR image aligned with GT. Then the pseudo-LR and LR image can be fed into the proposed adaptive spatial transformer networks (AdaSTN) to deform the LR features. During testing, SelfDZSR can be directly deployed to super-solve the whole short-focus image with the reference of telephoto image. Experiments show that our method achieves better quantitative and qualitative performance against state-of-the-arts. The code and pre-trained models will be publicly available.
Crowd counting is critical for numerous video surveillance scenarios. One of the main issues in this task is how to handle the dramatic scale variations of pedestrians caused by the perspective effect. To address this issue, this paper proposes a novel convolution neural network-based crowd counting method, termed Perspective-guided Fractional-Dilation Network (PFDNet). By modeling the continuous scale variations, the proposed PFDNet is able to select the proper fractional dilation kernels for adapting to different spatial locations. It significantly improves the flexibility of the state-of-the-arts that only consider the discrete representative scales. In addition, by avoiding the multi-scale or multi-column architecture that used in other methods, it is computationally more efficient. In practice, the proposed PFDNet is constructed by stacking multiple Perspective-guided Fractional-Dilation Convolutions (PFC) on a VGG16-BN backbone. By introducing a novel generalized dilation convolution operation, the PFC can handle fractional dilation ratios in the spatial domain under the guidance of perspective annotations, achieving continuous scales modeling of pedestrians. To deal with the problem of unavailable perspective information in some cases, we further introduce an effective perspective estimation branch to the proposed PFDNet, which can be trained in either supervised or weakly-supervised setting once the branch has been pre-trained. Extensive experiments show that the proposed PFDNet outperforms state-of-the-art methods on ShanghaiTech A, ShanghaiTech B, WorldExpo'10, UCF-QNRF, UCF_CC_50 and TRANCOS dataset, achieving MAE 53.8, 6.5, 6.8, 84.3, 205.8, and 3.06 respectively.
For flexible non-blind image denoising, existing deep networks usually take both noisy image and noise level map as the input to handle various noise levels with a single model. However, in this kind of solution, the noise variance (i.e., noise level) is only deployed to modulate the first layer of convolution feature with channel-wise shifting, which is limited in balancing noise removal and detail preservation. In this paper, we present a novel flexible image enoising network (CFMNet) by equipping an U-Net backbone with multi-layer conditional feature modulation (CFM) modules. In comparison to channel-wise shifting only in the first layer, CFMNet can make better use of noise level information by deploying multiple layers of CFM. Moreover, each CFM module takes onvolutional features from both noisy image and noise level map as input for better trade-off between noise removal and detail preservation. Experimental results show that our CFMNet is effective in exploiting noise level information for flexible non-blind denoising, and performs favorably against the existing deep image denoising methods in terms of both quantitative metrics and visual quality.