Alert button
Picture for Zhanghan Ke

Zhanghan Ke

Alert button

Towards Self-Adaptive Pseudo-Label Filtering for Semi-Supervised Learning

Sep 18, 2023
Lei Zhu, Zhanghan Ke, Rynson Lau

Recent semi-supervised learning (SSL) methods typically include a filtering strategy to improve the quality of pseudo labels. However, these filtering strategies are usually hand-crafted and do not change as the model is updated, resulting in a lot of correct pseudo labels being discarded and incorrect pseudo labels being selected during the training process. In this work, we observe that the distribution gap between the confidence values of correct and incorrect pseudo labels emerges at the very beginning of the training, which can be utilized to filter pseudo labels. Based on this observation, we propose a Self-Adaptive Pseudo-Label Filter (SPF), which automatically filters noise in pseudo labels in accordance with model evolvement by modeling the confidence distribution throughout the training process. Specifically, with an online mixture model, we weight each pseudo-labeled sample by the posterior of it being correct, which takes into consideration the confidence distribution at that time. Unlike previous handcrafted filters, our SPF evolves together with the deep neural network without manual tuning. Extensive experiments demonstrate that incorporating SPF into the existing SSL methods can help improve the performance of SSL, especially when the labeled data is extremely scarce.

* This paper was first submitted to NeurIPS 2021 
Viaarxiv icon

Neural Preset for Color Style Transfer

Mar 24, 2023
Zhanghan Ke, Yuhao Liu, Lei Zhu, Nanxuan Zhao, Rynson W. H. Lau

Figure 1 for Neural Preset for Color Style Transfer
Figure 2 for Neural Preset for Color Style Transfer
Figure 3 for Neural Preset for Color Style Transfer
Figure 4 for Neural Preset for Color Style Transfer

In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Notably, Neural Preset enables stable 4K color style transfer in real-time without artifacts. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization. Project page with demos: https://zhkkke.github.io/NeuralPreset .

* Project page with demos: https://zhkkke.github.io/NeuralPreset . Artifact-free real-time 4K color style transfer via AI-generated presets. CVPR 2023 
Viaarxiv icon

BiFormer: Vision Transformer with Bi-Level Routing Attention

Mar 15, 2023
Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, Rynson Lau

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a \textbf{query adaptive} manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at \url{https://github.com/rayleizhu/BiFormer}.

* CVPR 2023 camera-ready 
Viaarxiv icon

Structure-Informed Shadow Removal Networks

Jan 09, 2023
Yuhao Liu, Qing Guo, Lan Fu, Zhanghan Ke, Ke Xu, Wei Feng, Ivor W. Tsang, Rynson W. H. Lau

Figure 1 for Structure-Informed Shadow Removal Networks
Figure 2 for Structure-Informed Shadow Removal Networks
Figure 3 for Structure-Informed Shadow Removal Networks
Figure 4 for Structure-Informed Shadow Removal Networks

Shadow removal is a fundamental task in computer vision. Despite the success, existing deep learning-based shadow removal methods still produce images with shadow remnants. These shadow remnants typically exist in homogeneous regions with low intensity values, making them untraceable in the existing image-to-image mapping paradigm. We observe from our experiments that shadows mainly degrade object colors at the image structure level (in which humans perceive object outlines filled with continuous colors). Hence, in this paper, we propose to remove shadows at the image structure level. Based on this idea, we propose a novel structure-informed shadow removal network (StructNet) to leverage the image structure information to address the shadow remnant problem. Specifically, StructNet first reconstructs the structure information of the input image without shadows and then uses the restored shadow-free structure prior to guiding the image-level shadow removal. StructNet contains two main novel modules: (1) a mask-guided shadow-free extraction (MSFE) module to extract image structural features in a non-shadow to shadow directional manner, and (2) a multi-scale feature & residual aggregation (MFRA) module to leverage the shadow-free structure information to regularize feature consistency. In addition, we also propose to extend StructNet to exploit multi-level structure information (MStructNet), to further boost the shadow removal performance with minimum computational overheads. Extensive experiments on three shadow removal benchmarks demonstrate that our method outperforms existing shadow removal methods, and our StructNet can be integrated with existing methods to boost their performances further.

* Tech report 
Viaarxiv icon

Harmonizer: Learning to Perform White-Box Image and Video Harmonization

Jul 04, 2022
Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, Rynson W. H. Lau

Figure 1 for Harmonizer: Learning to Perform White-Box Image and Video Harmonization
Figure 2 for Harmonizer: Learning to Perform White-Box Image and Video Harmonization
Figure 3 for Harmonizer: Learning to Perform White-Box Image and Video Harmonization
Figure 4 for Harmonizer: Learning to Perform White-Box Image and Video Harmonization

Recent works on image harmonization solve the problem as a pixel-wise image translation task via large autoencoders. They have unsatisfactory performances and slow inference speeds when dealing with high-resolution images. In this work, we observe that adjusting the input arguments of basic image filters, e.g., brightness and contrast, is sufficient for humans to produce realistic images from the composite ones. Hence, we frame image harmonization as an image-level regression problem to learn the arguments of the filters that humans use for the task. We present a Harmonizer framework for image harmonization. Unlike prior methods that are based on black-box autoencoders, Harmonizer contains a neural network for filter argument prediction and several white-box filters (based on the predicted arguments) for image harmonization. We also introduce a cascade regressor and a dynamic loss strategy for Harmonizer to learn filter arguments more stably and precisely. Since our network only outputs image-level arguments and the filters we used are efficient, Harmonizer is much lighter and faster than existing methods. Comprehensive experiments demonstrate that Harmonizer surpasses existing methods notably, especially with high-resolution inputs. Finally, we apply Harmonizer to video harmonization, which achieves consistent results across frames and 56 fps at 1080P resolution. Code and models are available at: https://github.com/ZHKKKe/Harmonizer.

Viaarxiv icon

MODNet-V: Improving Portrait Video Matting via Background Restoration

Sep 24, 2021
Jiayu Sun, Zhanghan Ke, Lihe Zhang, Huchuan Lu, Rynson W. H. Lau

Figure 1 for MODNet-V: Improving Portrait Video Matting via Background Restoration
Figure 2 for MODNet-V: Improving Portrait Video Matting via Background Restoration
Figure 3 for MODNet-V: Improving Portrait Video Matting via Background Restoration
Figure 4 for MODNet-V: Improving Portrait Video Matting via Background Restoration

To address the challenging portrait video matting problem more precisely, existing works typically apply some matting priors that require additional user efforts to obtain, such as annotated trimaps or background images. In this work, we observe that instead of asking the user to explicitly provide a background image, we may recover it from the input video itself. To this end, we first propose a novel background restoration module (BRM) to recover the background image dynamically from the input video. BRM is extremely lightweight and can be easily integrated into existing matting models. By combining BRM with a recent image matting model, MODNet, we then present MODNet-V for portrait video matting. Benefited from the strong background prior provided by BRM, MODNet-V has only 1/3 of the parameters of MODNet but achieves comparable or even better performances. Our design allows MODNet-V to be trained in an end-to-end manner on a single NVIDIA 3090 GPU. Finally, we introduce a new patch refinement module (PRM) to adapt MODNet-V for high-resolution videos while keeping MODNet-V lightweight and fast.

Viaarxiv icon

Is a Green Screen Really Necessary for Real-Time Portrait Matting?

Nov 29, 2020
Zhanghan Ke, Kaican Li, Yurou Zhou, Qiuhua Wu, Xiangyu Mao, Qiong Yan, Rynson W. H. Lau

Figure 1 for Is a Green Screen Really Necessary for Real-Time Portrait Matting?
Figure 2 for Is a Green Screen Really Necessary for Real-Time Portrait Matting?
Figure 3 for Is a Green Screen Really Necessary for Real-Time Portrait Matting?
Figure 4 for Is a Green Screen Really Necessary for Real-Time Portrait Matting?

For portrait matting without the green screen, existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. Consequently, they are unavailable in real-time applications. In contrast, we present a light-weight matting objective decomposition network (MODNet), which can process portrait matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. Moreover, since trimap-free methods usually suffer from the domain shift problem in practice, we introduce (1) a self-supervised strategy based on sub-objectives consistency to adapt MODNet to real-world data and (2) a one-frame delay trick to smooth the results when applying MODNet to portrait video sequence. MODNet is easy to be trained in an end-to-end style. It is much faster than contemporaneous matting methods and runs at 63 frames per second. On a carefully designed portrait matting benchmark newly proposed in this work, MODNet greatly outperforms prior trimap-free methods. More importantly, our method achieves remarkable results in daily photos and videos. Now, do you really need a green screen for real-time portrait matting?

Viaarxiv icon

Guided Collaborative Training for Pixel-wise Semi-Supervised Learning

Aug 12, 2020
Zhanghan Ke, Di Qiu, Kaican Li, Qiong Yan, Rynson W. H. Lau

Figure 1 for Guided Collaborative Training for Pixel-wise Semi-Supervised Learning
Figure 2 for Guided Collaborative Training for Pixel-wise Semi-Supervised Learning
Figure 3 for Guided Collaborative Training for Pixel-wise Semi-Supervised Learning
Figure 4 for Guided Collaborative Training for Pixel-wise Semi-Supervised Learning

We investigate the generalization of semi-supervised learning (SSL) to diverse pixel-wise tasks. Although SSL methods have achieved impressive results in image classification, the performances of applying them to pixel-wise tasks are unsatisfactory due to their need for dense outputs. In addition, existing pixel-wise SSL approaches are only suitable for certain tasks as they usually require to use task-specific properties. In this paper, we present a new SSL framework, named Guided Collaborative Training (GCT), for pixel-wise tasks, with two main technical contributions. First, GCT addresses the issues caused by the dense outputs through a novel flaw detector. Second, the modules in GCT learn from unlabeled data collaboratively through two newly proposed constraints that are independent of task-specific properties. As a result, GCT can be applied to a wide range of pixel-wise tasks without structural adaptation. Our extensive experiments on four challenging vision tasks, including semantic segmentation, real image denoising, portrait image matting, and night image enhancement, show that GCT outperforms state-of-the-art SSL methods by a large margin. Our code available at: https://github.com/ZHKKKe/PixelSSL.

* 16th European Conference on Computer Vision (ECCV 2020) 
Viaarxiv icon

Towards Geometry Guided Neural Relighting with Flash Photography

Aug 12, 2020
Di Qiu, Jin Zeng, Zhanghan Ke, Wenxiu Sun, Chengxi Yang

Figure 1 for Towards Geometry Guided Neural Relighting with Flash Photography
Figure 2 for Towards Geometry Guided Neural Relighting with Flash Photography
Figure 3 for Towards Geometry Guided Neural Relighting with Flash Photography
Figure 4 for Towards Geometry Guided Neural Relighting with Flash Photography

Previous image based relighting methods require capturing multiple images to acquire high frequency lighting effect under different lighting conditions, which needs nontrivial effort and may be unrealistic in certain practical use scenarios. While such approaches rely entirely on cleverly sampling the color images under different lighting conditions, little has been done to utilize geometric information that crucially influences the high-frequency features in the images, such as glossy highlight and cast shadow. We therefore propose a framework for image relighting from a single flash photograph with its corresponding depth map using deep learning. By incorporating the depth map, our approach is able to extrapolate realistic high-frequency effects under novel lighting via geometry guided image decomposition from the flashlight image, and predict the cast shadow map from the shadow-encoding transformed depth map. Moreover, the single-image based setup greatly simplifies the data capture process. We experimentally validate the advantage of our geometry guided approach over state-of-the-art image-based approaches in intrinsic image decomposition and image relighting, and also demonstrate our performance on real mobile phone photo examples.

Viaarxiv icon

Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning

Sep 03, 2019
Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren, Rynson W. H. Lau

Figure 1 for Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning
Figure 2 for Dual Student: Breaking the Limits of the Teacher in Semi-supervised Learning

Recently, consistency-based methods have achieved state-of-the-art results in semi-supervised learning (SSL). These methods always involve two roles, an explicit or implicit teacher model and a student model, and penalize predictions under different perturbations by a consistency constraint. However, the weights of these two roles are tightly coupled since the teacher is essentially an exponential moving average (EMA) of the student. In this work, we show that the coupled EMA teacher causes a performance bottleneck. To address this problem, we introduce Dual Student, which replaces the teacher with another student. We also define a novel concept, stable sample, following which a stabilization constraint is designed for our structure to be trainable. Further, we discuss two variants of our method, which produce even higher performance. Extensive experiments show that our method improves the classification performance significantly on several main SSL benchmarks. Specifically, it reduces the error rate of the 13-layer CNN from 16.84% to 12.39% on CIFAR-10 with 1k labels and from 34.10% to 31.56% on CIFAR-100 with 10k labels. In addition, our method also achieves a clear improvement in domain adaptation.

* International Conference in Computer Vision 2019 (ICCV 2019) 
Viaarxiv icon