Alert button
Picture for Junping Zhang

Junping Zhang

Alert button

Point, Segment and Count: A Generalized Framework for Object Counting

Nov 27, 2023
Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, Hongming Shan

Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. Current state-of-the-art methods highly rely on density maps to predict object counts, which lacks model interpretability. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147 dataset demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection, with additional results on large-scale COCO and LVIS datasets. The source code is available at \url{https://github.com/Hzzone/PseCo}.

* Fix typos 
Viaarxiv icon

ViP-Mixer: A Convolutional Mixer for Video Prediction

Nov 20, 2023
Xin Zheng, Ziang Peng, Yuan Cao, Hongming Shan, Junping Zhang

Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.

* Under review 
Viaarxiv icon

LICO: Explainable Models with Language-Image Consistency

Oct 15, 2023
Yiming Lei, Zilong Li, Yangyang Li, Junping Zhang, Hongming Shan

Interpreting the decisions of deep learning models has been actively studied since the explosion of deep neural networks. One of the most convincing interpretation approaches is salience-based visual interpretation, such as Grad-CAM, where the generation of attention maps depends merely on categorical labels. Although existing interpretation methods can provide explainable decision clues, they often yield partial correspondence between image and saliency maps due to the limited discriminative information from one-hot labels. This paper develops a Language-Image COnsistency model for explainable image classification, termed LICO, by correlating learnable linguistic prompts with corresponding visual features in a coarse-to-fine manner. Specifically, we first establish a coarse global manifold structure alignment by minimizing the distance between the distributions of image and language features. We then achieve fine-grained saliency maps by applying optimal transport (OT) theory to assign local feature maps with class-specific prompts. Extensive experimental results on eight benchmark datasets demonstrate that the proposed LICO achieves a significant improvement in generating more explainable attention maps in conjunction with existing interpretation methods such as Grad-CAM. Remarkably, LICO improves the classification performance of existing models without introducing any computational overhead during inference. Source code is made available at https://github.com/ymLeiFDU/LICO.

* Accepted by NeurIPS 2023 
Viaarxiv icon

Semantic Latent Decomposition with Normalizing Flows for Face Editing

Sep 11, 2023
Binglei Li, Zhizhong Huang, Hongming Shan, Junping Zhang

Figure 1 for Semantic Latent Decomposition with Normalizing Flows for Face Editing
Figure 2 for Semantic Latent Decomposition with Normalizing Flows for Face Editing
Figure 3 for Semantic Latent Decomposition with Normalizing Flows for Face Editing
Figure 4 for Semantic Latent Decomposition with Normalizing Flows for Face Editing

Navigating in the latent space of StyleGAN has shown effectiveness for face editing. However, the resulting methods usually encounter challenges in complicated navigation due to the entanglement among different attributes in the latent space. To address this issue, this paper proposes a novel framework, termed SDFlow, with a semantic decomposition in original latent space using continuous conditional normalizing flows. Specifically, SDFlow decomposes the original latent code into different irrelevant variables by jointly optimizing two components: (i) a semantic encoder to estimate semantic variables from input faces and (ii) a flow-based transformation module to map the latent code into a semantic-irrelevant variable in Gaussian distribution, conditioned on the learned semantic variables. To eliminate the entanglement between variables, we employ a disentangled learning strategy under a mutual information framework, thereby providing precise manipulation controls. Experimental results demonstrate that SDFlow outperforms existing state-of-the-art face editing methods both qualitatively and quantitatively. The source code is made available at https://github.com/phil329/SDFlow.

Viaarxiv icon

Learning to Distill Global Representation for Sparse-View CT

Aug 19, 2023
Zilong Li, Chenglong Ma, Jie Chen, Junping Zhang, Hongming Shan

Figure 1 for Learning to Distill Global Representation for Sparse-View CT
Figure 2 for Learning to Distill Global Representation for Sparse-View CT
Figure 3 for Learning to Distill Global Representation for Sparse-View CT
Figure 4 for Learning to Distill Global Representation for Sparse-View CT

Sparse-view computed tomography (CT) -- using a small number of projections for tomographic reconstruction -- enables much lower radiation dose to patients and accelerated data acquisition. The reconstructed images, however, suffer from strong artifacts, greatly limiting their diagnostic value. Current trends for sparse-view CT turn to the raw data for better information recovery. The resultant dual-domain methods, nonetheless, suffer from secondary artifacts, especially in ultra-sparse view scenarios, and their generalization to other scanners/protocols is greatly limited. A crucial question arises: have the image post-processing methods reached the limit? Our answer is not yet. In this paper, we stick to image post-processing methods due to great flexibility and propose global representation (GloRe) distillation framework for sparse-view CT, termed GloReDi. First, we propose to learn GloRe with Fourier convolution, so each element in GloRe has an image-wide receptive field. Second, unlike methods that only use the full-view images for supervision, we propose to distill GloRe from intermediate-view reconstructed images that are readily available but not explored in previous literature. The success of GloRe distillation is attributed to two key components: representation directional distillation to align the GloRe directions, and band-pass-specific contrastive distillation to gain clinically important details. Extensive experiments demonstrate the superiority of the proposed GloReDi over the state-of-the-art methods, including dual-domain ones. The source code is available at https://github.com/longzilicart/GloReDi.

* ICCV 2023 
Viaarxiv icon

Online Prototype Learning for Online Continual Learning

Aug 01, 2023
Yujie Wei, Jiaxin Ye, Zhizhong Huang, Junping Zhang, Hongming Shan

Figure 1 for Online Prototype Learning for Online Continual Learning
Figure 2 for Online Prototype Learning for Online Continual Learning
Figure 3 for Online Prototype Learning for Online Continual Learning
Figure 4 for Online Prototype Learning for Online Continual Learning

Online continual learning (CL) studies the problem of learning continuously from a single-pass data stream while adapting to new data and mitigating catastrophic forgetting. Recently, by storing a small subset of old data, replay-based methods have shown promising performance. Unlike previous methods that focus on sample storage or knowledge distillation against catastrophic forgetting, this paper aims to understand why the online learning models fail to generalize well from a new perspective of shortcut learning. We identify shortcut learning as the key limiting factor for online CL, where the learned features may be biased, not generalizable to new tasks, and may have an adverse impact on knowledge distillation. To tackle this issue, we present the online prototype learning (OnPro) framework for online CL. First, we propose online prototype equilibrium to learn representative features against shortcut learning and discriminative features to avoid class confusion, ultimately achieving an equilibrium status that separates all seen classes well while learning new classes. Second, with the feedback of online prototypes, we devise a novel adaptive prototypical feedback mechanism to sense the classes that are easily misclassified and then enhance their boundaries. Extensive experimental results on widely-used benchmark datasets demonstrate the superior performance of OnPro over the state-of-the-art baseline methods. Source code is available at https://github.com/weilllllls/OnPro.

* ICCV 2023 
Viaarxiv icon

Adaptive Nonlinear Latent Transformation for Conditional Face Editing

Jul 15, 2023
Zhizhong Huang, Siteng Ma, Junping Zhang, Hongming Shan

Figure 1 for Adaptive Nonlinear Latent Transformation for Conditional Face Editing
Figure 2 for Adaptive Nonlinear Latent Transformation for Conditional Face Editing
Figure 3 for Adaptive Nonlinear Latent Transformation for Conditional Face Editing
Figure 4 for Adaptive Nonlinear Latent Transformation for Conditional Face Editing

Recent works for face editing usually manipulate the latent space of StyleGAN via the linear semantic directions. However, they usually suffer from the entanglement of facial attributes, need to tune the optimal editing strength, and are limited to binary attributes with strong supervision signals. This paper proposes a novel adaptive nonlinear latent transformation for disentangled and conditional face editing, termed AdaTrans. Specifically, our AdaTrans divides the manipulation process into several finer steps; i.e., the direction and size at each step are conditioned on both the facial attributes and the latent codes. In this way, AdaTrans describes an adaptive nonlinear transformation trajectory to manipulate the faces into target attributes while keeping other attributes unchanged. Then, AdaTrans leverages a predefined density model to constrain the learned trajectory in the distribution of latent codes by maximizing the likelihood of transformed latent code. Moreover, we also propose a disentangled learning strategy under a mutual information framework to eliminate the entanglement among attributes, which can further relax the need for labeled data. Consequently, AdaTrans enables a controllable face editing with the advantages of disentanglement, flexibility with non-binary attributes, and high fidelity. Extensive experimental results on various facial attributes demonstrate the qualitative and quantitative effectiveness of the proposed AdaTrans over existing state-of-the-art methods, especially in the most challenging scenarios with a large age gap and few labeled examples. The source code is available at https://github.com/Hzzone/AdaTrans.

* ICCV 2023 
Viaarxiv icon

FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction

Jul 12, 2023
Chenglong Ma, Zilong Li, Junping Zhang, Yi Zhang, Hongming Shan

Figure 1 for FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction
Figure 2 for FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction
Figure 3 for FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction
Figure 4 for FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction

Sparse-view computed tomography (CT) is a promising solution for expediting the scanning process and mitigating radiation exposure to patients, the reconstructed images, however, contain severe streak artifacts, compromising subsequent screening and diagnosis. Recently, deep learning-based image post-processing methods along with their dual-domain counterparts have shown promising results. However, existing methods usually produce over-smoothed images with loss of details due to (1) the difficulty in accurately modeling the artifact patterns in the image domain, and (2) the equal treatment of each pixel in the loss function. To address these issues, we concentrate on the image post-processing and propose a simple yet effective FREquency-band-awarE and SElf-guidED network, termed FreeSeed, which can effectively remove artifact and recover missing detail from the contaminated sparse-view CT images. Specifically, we first propose a frequency-band-aware artifact modeling network (FreeNet), which learns artifact-related frequency-band attention in Fourier domain for better modeling the globally distributed streak artifact on the sparse-view CT images. We then introduce a self-guided artifact refinement network (SeedNet), which leverages the predicted artifact to assist FreeNet in continuing to refine the severely corrupted details. Extensive experiments demonstrate the superior performance of FreeSeed and its dual-domain counterpart over the state-of-the-art sparse-view CT reconstruction methods. Source code is made available at https://github.com/Masaaki-75/freeseed.

* MICCAI 2023 
Viaarxiv icon

CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction

Apr 17, 2023
Yiming Lei, Zilong Li, Yan Shen, Junping Zhang, Hongming Shan

Figure 1 for CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction
Figure 2 for CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction
Figure 3 for CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction
Figure 4 for CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction

Lung nodule malignancy prediction has been enhanced by advanced deep-learning techniques and effective tricks. Nevertheless, current methods are mainly trained with cross-entropy loss using one-hot categorical labels, which results in difficulty in distinguishing those nodules with closer progression labels. Interestingly, we observe that clinical text information annotated by radiologists provides us with discriminative knowledge to identify challenging samples. Drawing on the capability of the contrastive language-image pre-training (CLIP) model to learn generalized visual representations from text annotations, in this paper, we propose CLIP-Lung, a textual knowledge-guided framework for lung nodule malignancy prediction. First, CLIP-Lung introduces both class and attribute annotations into the training of the lung nodule classifier without any additional overheads in inference. Second, we designed a channel-wise conditional prompt (CCP) module to establish consistent relationships between learnable context prompts and specific feature maps. Third, we align image features with both class and attribute features via contrastive learning, rectifying false positives and false negatives in latent space. The experimental results on the benchmark LIDC-IDRI dataset have demonstrated the superiority of CLIP-Lung, both in classification performance and interpretability of attention maps.

Viaarxiv icon