Alert button
Picture for Zheng Yang

Zheng Yang

Alert button

Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

Nov 29, 2023
Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia, Chaotian Song, Qinglin Lu, Wei Liu, Boxi Wu

Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.

Viaarxiv icon

Pair-wise Layer Attention with Spatial Masking for Video Prediction

Nov 19, 2023
Ping Li, Chenhan Zhang, Zheng Yang, Xianghua Xu, Mingli Song

Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.

Viaarxiv icon

Few-shot Hybrid Domain Adaptation of Image Generators

Oct 30, 2023
Hengjia Li, Yang Liu, Linxuan Xia, Yuqi Lin, Tu Zheng, Zheng Yang, Wenxiao Wang, Xiaohui Zhong, Xiaobo Ren, Xiaofei He

Can a pre-trained generator be adapted to the hybrid of multiple target domains and generate images with integrated attributes of them? In this work, we introduce a new task -- Few-shot Hybrid Domain Adaptation (HDA). Given a source generator and several target domains, HDA aims to acquire an adapted generator that preserves the integrated attributes of all target domains, without overriding the source domain's characteristics. Compared with Domain Adaptation (DA), HDA offers greater flexibility and versatility to adapt generators to more composite and expansive domains. Simultaneously, HDA also presents more challenges than DA as we have access only to images from individual target domains and lack authentic images from the hybrid domain. To address this issue, we introduce a discriminator-free framework that directly encodes different domains' images into well-separable subspaces. To achieve HDA, we propose a novel directional subspace loss comprised of a distance loss and a direction loss. Concretely, the distance loss blends the attributes of all target domains by reducing the distances from generated images to all target subspaces. The direction loss preserves the characteristics from the source domain by guiding the adaptation along the perpendicular to subspaces. Experiments show that our method can obtain numerous domain-specific attributes in a single adapted generator, which surpasses the baseline methods in semantic similarity, image fidelity, and cross-domain consistency.

Viaarxiv icon

NormKD: Normalized Logits for Knowledge Distillation

Aug 01, 2023
Zhihao Chi, Tu Zheng, Hengjia Li, Zheng Yang, Boxi Wu, Binbin Lin, Deng Cai

Figure 1 for NormKD: Normalized Logits for Knowledge Distillation
Figure 2 for NormKD: Normalized Logits for Knowledge Distillation
Figure 3 for NormKD: Normalized Logits for Knowledge Distillation
Figure 4 for NormKD: Normalized Logits for Knowledge Distillation

Logit based knowledge distillation gets less attention in recent years since feature based methods perform better in most cases. Nevertheless, we find it still has untapped potential when we re-investigate the temperature, which is a crucial hyper-parameter to soften the logit outputs. For most of the previous works, it was set as a fixed value for the entire distillation procedure. However, as the logits from different samples are distributed quite variously, it is not feasible to soften all of them to an equal degree by just a single temperature, which may make the previous work transfer the knowledge of each sample inadequately. In this paper, we restudy the hyper-parameter temperature and figure out its incapability to distill the knowledge from each sample sufficiently when it is a single value. To address this issue, we propose Normalized Knowledge Distillation (NormKD), with the purpose of customizing the temperature for each sample according to the characteristic of the sample's logit distribution. Compared to the vanilla KD, NormKD barely has extra computation or storage cost but performs significantly better on CIRAR-100 and ImageNet for image classification. Furthermore, NormKD can be easily applied to the other logit based methods and achieve better performance which can be closer to or even better than the feature based method.

Viaarxiv icon

Learning Occupancy for Monocular 3D Object Detection

May 25, 2023
Liang Peng, Junkai Xu, Haoran Cheng, Zheng Yang, Xiaopei Wu, Wei Qian, Wenxiao Wang, Boxi Wu, Deng Cai

Figure 1 for Learning Occupancy for Monocular 3D Object Detection
Figure 2 for Learning Occupancy for Monocular 3D Object Detection
Figure 3 for Learning Occupancy for Monocular 3D Object Detection
Figure 4 for Learning Occupancy for Monocular 3D Object Detection

Monocular 3D detection is a challenging task due to the lack of accurate 3D information. Existing approaches typically rely on geometry constraints and dense depth estimates to facilitate the learning, but often fail to fully exploit the benefits of three-dimensional feature extraction in frustum and 3D space. In this paper, we propose \textbf{OccupancyM3D}, a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations. Specifically, by using synchronized raw sparse LiDAR point clouds, we define the space status and generate voxel-based occupancy labels. We formulate occupancy prediction as a simple classification problem and design associated occupancy losses. Resulting occupancy estimates are employed to enhance original frustum/3D features. As a result, experiments on KITTI and Waymo open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin. Codes and pre-trained models will be available at \url{https://github.com/SPengLiang/OccupancyM3D}.

Viaarxiv icon

APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Mar 31, 2023
Hengjia Li, Tu Zheng, Zhihao Chi, Zheng Yang, Wenxiao Wang, Boxi Wu, Binbin Lin, Deng Cai

Figure 1 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding
Figure 2 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding
Figure 3 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding
Figure 4 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Transformer-based networks have achieved impressive performance in 3D point cloud understanding. However, most of them concentrate on aggregating local features, but neglect to directly model global dependencies, which results in a limited effective receptive field. Besides, how to effectively incorporate local and global components also remains challenging. To tackle these problems, we propose Asymmetric Parallel Point Transformer (APPT). Specifically, we introduce Global Pivot Attention to extract global features and enlarge the effective receptive field. Moreover, we design the Asymmetric Parallel structure to effectively integrate local and global information. Combined with these designs, APPT is able to capture features globally throughout the entire network while focusing on local-detailed features. Extensive experiments show that our method outperforms the priors and achieves state-of-the-art on several benchmarks for 3D point cloud understanding, such as 3D semantic segmentation on S3DIS, 3D shape classification on ModelNet40, and 3D part segmentation on ShapeNet.

Viaarxiv icon

Improving Multi-generation Robustness of Learned Image Compression

Oct 31, 2022
Litian Li, Zheng Yang, Ronggang Wang

Figure 1 for Improving Multi-generation Robustness of Learned Image Compression
Figure 2 for Improving Multi-generation Robustness of Learned Image Compression
Figure 3 for Improving Multi-generation Robustness of Learned Image Compression
Figure 4 for Improving Multi-generation Robustness of Learned Image Compression

Benefit from flexible network designs and end-to-end joint optimization approach, learned image compression (LIC) has demonstrated excellent coding performance and practical feasibility in recent years. However, existing compression models suffer from serious multi-generation loss, which always occurs during image editing and transcoding. During the process of repeatedly encoding and decoding, the quality of the image will rapidly degrade, resulting in various types of distortion, which significantly limits the practical application of LIC. In this paper, a thorough analysis is carried out to determine the source of generative loss in successive image compression (SIC). We point out and solve the quantization drift problem that affects SIC, reversibility loss function as well as channel relaxation method are proposed to further reduce the generation loss. Experiments show that by using our proposed solutions, LIC can achieve comparable performance to the first compression of BPG even after 50 times reencoding without any change of the network structure.

Viaarxiv icon

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Jul 22, 2022
Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, Deng Cai

Figure 1 for DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
Figure 2 for DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
Figure 3 for DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
Figure 4 for DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects' appearances and positions on the image. By contrast, the attribute depth relies on objects' inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at https://github.com/SPengLiang/DID-M3D.

* ECCV 2022 
Viaarxiv icon

Dual Power Spectrum Manifold and Toeplitz HPD Manifold: Enhancement and Analysis for Matrix CFAR Detection

Jun 24, 2022
Hao Wu, Yongqiang Cheng, Xixi Chen, Zheng Yang, Xiang Li, Hongqiang Wang

Figure 1 for Dual Power Spectrum Manifold and Toeplitz HPD Manifold: Enhancement and Analysis for Matrix CFAR Detection
Figure 2 for Dual Power Spectrum Manifold and Toeplitz HPD Manifold: Enhancement and Analysis for Matrix CFAR Detection
Figure 3 for Dual Power Spectrum Manifold and Toeplitz HPD Manifold: Enhancement and Analysis for Matrix CFAR Detection
Figure 4 for Dual Power Spectrum Manifold and Toeplitz HPD Manifold: Enhancement and Analysis for Matrix CFAR Detection

Recently, an innovative matrix CFAR detection scheme based on information geometry, also referred to as the geometric detector, has been developed speedily and exhibits distinct advantages in several practical applications. These advantages benefit from the geometry of the Toeplitz Hermitian positive definite (HPD) manifold $\mathcal{M}_{\mathcal{T}H_{++}}$, but the sophisticated geometry also results in some challenges for geometric detectors, such as the implementation of the enhanced detector to improve the SCR (signal-to-clutter ratio) and the analysis of the detection performance. To meet these challenges, this paper develops the dual power spectrum manifold $\mathcal{M}_{\text{P}}$ as the dual space of $\mathcal{M}_{\mathcal{T}H_{++}}$. For each affine invariant geometric measure on $\mathcal{M}_{\mathcal{T}H_{++}}$, we show that there exists an equivalent function named induced potential function on $\mathcal{M}_{\text{P}}$. By the induced potential function, the measurements of the dissimilarity between two matrices can be implemented on $\mathcal{M}_{\text{P}}$, and the geometric detectors can be reformulated as the form related to the power spectrum. The induced potential function leads to two contributions: 1) The enhancement of the geometric detector, which is formulated as an optimization problem concerning $\mathcal{M}_{\mathcal{T}H_{++}}$, is transformed to an equivalent and simpler optimization on $\mathcal{M}_{\text{P}}$. In the presented example of the enhancement, the closed-form solution, instead of the gradient descent method, is provided through the equivalent optimization. 2) The detection performance is analyzed based on $\mathcal{M}_{\text{P}}$, and the advantageous characteristics, which benefit the detection performance, can be deduced by analyzing the corresponding power spectrum to the maximal point of the induced potential function.

* Submitted to IEEE Transactions on Information Theory 
Viaarxiv icon