Alert button
Picture for Chengjiang Long

Chengjiang Long

Alert button

Learning Dynamic Style Kernels for Artistic Style Transfer

Apr 14, 2023
Wenju Xu, Chengjiang Long, Yongwei Nie

Figure 1 for Learning Dynamic Style Kernels for Artistic Style Transfer
Figure 2 for Learning Dynamic Style Kernels for Artistic Style Transfer
Figure 3 for Learning Dynamic Style Kernels for Artistic Style Transfer
Figure 4 for Learning Dynamic Style Kernels for Artistic Style Transfer

Arbitrary style transfer has been demonstrated to be efficient in artistic image generation. Previous methods either globally modulate the content feature ignoring local details, or overly focus on the local structure details leading to style leakage. In contrast to the literature, we propose a new scheme \textit{``style kernel"} that learns {\em spatially adaptive kernels} for per-pixel stylization, where the convolutional kernels are dynamically generated from the global style-content aligned feature and then the learned kernels are applied to modulate the content feature at each spatial position. This new scheme allows flexible both global and local interactions between the content and style features such that the wanted styles can be easily transferred to the content image while at the same time the content structure can be easily preserved. To further enhance the flexibility of our style transfer method, we propose a Style Alignment Encoding (SAE) module complemented with a Content-based Gating Modulation (CGM) module for learning the dynamic style kernels in focusing regions. Extensive experiments strongly demonstrate that our proposed method outperforms state-of-the-art methods and exhibits superior performance in terms of visual quality and efficiency.

Viaarxiv icon

Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition

Apr 10, 2023
Zhijun Zhai, Jianhui Zhao, Chengjiang Long, Wenju Xu, Shuangjiang He, Huijuan Zhao

Figure 1 for Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
Figure 2 for Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
Figure 3 for Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
Figure 4 for Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition

Micro-expressions are spontaneous, rapid and subtle facial movements that can neither be forged nor suppressed. They are very important nonverbal communication clues, but are transient and of low intensity thus difficult to recognize. Recently deep learning based methods have been developed for micro-expression (ME) recognition using feature extraction and fusion techniques, however, targeted feature learning and efficient feature fusion still lack further study according to the ME characteristics. To address these issues, we propose a novel framework Feature Representation Learning with adaptive Displacement Generation and Transformer fusion (FRL-DGT), in which a convolutional Displacement Generation Module (DGM) with self-supervised learning is used to extract dynamic features from onset/apex frames targeted to the subsequent ME recognition task, and a well-designed Transformer Fusion mechanism composed of three Transformer-based fusion modules (local, global fusions based on AU regions and full-face fusion) is applied to extract the multi-level informative features after DGM for the final ME prediction. The extensive experiments with solid leave-one-subject-out (LOSO) evaluation results have demonstrated the superiority of our proposed FRL-DGT to state-of-the-art methods.

Viaarxiv icon

Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation

Mar 27, 2023
Clinton Ansun Mo, Kun Hu, Chengjiang Long, Zhiyong Wang

Figure 1 for Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation
Figure 2 for Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation
Figure 3 for Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation
Figure 4 for Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation

Deriving sophisticated 3D motions from sparse keyframes is a particularly challenging problem, due to continuity and exceptionally skeletal precision. The action features are often derivable accurately from the full series of keyframes, and thus, leveraging the global context with transformers has been a promising data-driven embedding approach. However, existing methods are often with inputs of interpolated intermediate frame for continuity using basic interpolation methods with keyframes, which result in a trivial local minimum during training. In this paper, we propose a novel framework to formulate latent motion manifolds with keyframe-based constraints, from which the continuous nature of intermediate token representations is considered. Particularly, our proposed framework consists of two stages for identifying a latent motion subspace, i.e., a keyframe encoding stage and an intermediate token generation stage, and a subsequent motion synthesis stage to extrapolate and compose motion data from manifolds. Through our extensive experiments conducted on both the LaFAN1 and CMU Mocap datasets, our proposed method demonstrates both superior interpolation accuracy and high visual similarity to ground truth motions.

* Accepted by CVPR 2023 
Viaarxiv icon

Explore Contextual Information for 3D Scene Graph Generation

Oct 12, 2022
Yuanyuan Liu, Chengjiang Long, Zhaoxuan Zhang, Bokai Liu, Qiang Zhang, Baocai Yin, Xin Yang

Figure 1 for Explore Contextual Information for 3D Scene Graph Generation
Figure 2 for Explore Contextual Information for 3D Scene Graph Generation
Figure 3 for Explore Contextual Information for 3D Scene Graph Generation
Figure 4 for Explore Contextual Information for 3D Scene Graph Generation

3D scene graph generation (SGG) has been of high interest in computer vision. Although the accuracy of 3D SGG on coarse classification and single relation label has been gradually improved, the performance of existing works is still far from being perfect for fine-grained and multi-label situations. In this paper, we propose a framework fully exploring contextual information for the 3D SGG task, which attempts to satisfy the requirements of fine-grained entity class, multiple relation labels, and high accuracy simultaneously. Our proposed approach is composed of a Graph Feature Extraction module and a Graph Contextual Reasoning module, achieving appropriate information-redundancy feature extraction, structured organization, and hierarchical inferring. Our approach achieves superior or competitive performance over previous methods on the 3DSSG dataset, especially on the relationship prediction sub-task.

Viaarxiv icon

Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space

Jul 15, 2022
Lingwei Dang, Yongwei Nie, Chengjiang Long, Qing Zhang, Guiqing Li

Figure 1 for Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space
Figure 2 for Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space
Figure 3 for Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space
Figure 4 for Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space

Diverse human motion prediction aims at predicting multiple possible future pose sequences from a sequence of observed poses. Previous approaches usually employ deep generative networks to model the conditional distribution of data, and then randomly sample outcomes from the distribution. While different results can be obtained, they are usually the most likely ones which are not diverse enough. Recent work explicitly learns multiple modes of the conditional distribution via a deterministic network, which however can only cover a fixed number of modes within a limited range. In this paper, we propose a novel sampling strategy for sampling very diverse results from an imbalanced multimodal distribution learned by a deep generative model. Our method works by generating an auxiliary space and smartly making randomly sampling from the auxiliary space equivalent to the diverse sampling from the target distribution. We propose a simple yet effective network architecture that implements this novel sampling strategy, which incorporates a Gumbel-Softmax coefficient matrix sampling method and an aggressive diversity promoting hinge loss function. Extensive experiments demonstrate that our method significantly improves both the diversity and accuracy of the samplings compared with previous state-of-the-art sampling approaches. Code and pre-trained models are available at https://github.com/Droliven/diverse_sampling.

* Paper and Supp of our work accepted by ACM MM 2022 
Viaarxiv icon

Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training

Jun 17, 2022
Xiao Lu, Yihong Cao, Sheng Liu, Chengjiang Long, Zipei Chen, Xuanyu Zhou, Yimin Yang, Chunxia Xiao

Figure 1 for Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training
Figure 2 for Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training
Figure 3 for Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training
Figure 4 for Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training

It is challenging to annotate large-scale datasets for supervised video shadow detection methods. Using a model trained on labeled images to the video frames directly may lead to high generalization error and temporal inconsistent results. In this paper, we address these challenges by proposing a Spatio-Temporal Interpolation Consistency Training (STICT) framework to rationally feed the unlabeled video frames together with the labeled images into an image shadow detection network training. Specifically, we propose the Spatial and Temporal ICT, in which we define two new interpolation schemes, \textit{i.e.}, the spatial interpolation and the temporal interpolation. We then derive the spatial and temporal interpolation consistency constraints accordingly for enhancing generalization in the pixel-wise classification task and for encouraging temporal consistent predictions, respectively. In addition, we design a Scale-Aware Network for multi-scale shadow knowledge learning in images, and propose a scale-consistency constraint to minimize the discrepancy among the predictions at different scales. Our proposed approach is extensively validated on the ViSha dataset and a self-annotated dataset. Experimental results show that, even without video labels, our approach is better than most state of the art supervised, semi-supervised or unsupervised image/video shadow detection methods and other methods in related tasks. Code and dataset are available at \url{https://github.com/yihong-97/STICT}.

* Accepted in CVPR2022 
Viaarxiv icon

Social Interpretable Tree for Pedestrian Trajectory Prediction

May 26, 2022
Liushuai Shi, Le Wang, Chengjiang Long, Sanping Zhou, Fang Zheng, Nanning Zheng, Gang Hua

Figure 1 for Social Interpretable Tree for Pedestrian Trajectory Prediction
Figure 2 for Social Interpretable Tree for Pedestrian Trajectory Prediction
Figure 3 for Social Interpretable Tree for Pedestrian Trajectory Prediction
Figure 4 for Social Interpretable Tree for Pedestrian Trajectory Prediction

Understanding the multiple socially-acceptable future behaviors is an essential task for many vision applications. In this paper, we propose a tree-based method, termed as Social Interpretable Tree (SIT), to address this multi-modal prediction task, where a hand-crafted tree is built depending on the prior information of observed trajectory to model multiple future trajectories. Specifically, a path in the tree from the root to leaf represents an individual possible future trajectory. SIT employs a coarse-to-fine optimization strategy, in which the tree is first built by high-order velocity to balance the complexity and coverage of the tree and then optimized greedily to encourage multimodality. Finally, a teacher-forcing refining operation is used to predict the final fine trajectory. Compared with prior methods which leverage implicit latent variables to represent possible future trajectories, the path in the tree can explicitly explain the rough moving behaviors (e.g., go straight and then turn right), and thus provides better interpretability. Despite the hand-crafted tree, the experimental results on ETH-UCY and Stanford Drone datasets demonstrate that our method is capable of matching or exceeding the performance of state-of-the-art methods. Interestingly, the experiments show that the raw built tree without training outperforms many prior deep neural network based approaches. Meanwhile, our method presents sufficient flexibility in long-term prediction and different best-of-$K$ predictions.

* Accepted by AAAI2022 
Viaarxiv icon

Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction

Mar 30, 2022
Tiezheng Ma, Yongwei Nie, Chengjiang Long, Qing Zhang, Guiqing Li

Figure 1 for Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction
Figure 2 for Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction
Figure 3 for Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction
Figure 4 for Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction

This paper presents a high-quality human motion prediction method that accurately predicts future human poses given observed ones. Our method is based on the observation that a good initial guess of the future poses is very helpful in improving the forecasting accuracy. This motivates us to propose a novel two-stage prediction framework, including an init-prediction network that just computes the good guess and then a formal-prediction network that predicts the target future poses based on the guess. More importantly, we extend this idea further and design a multi-stage prediction framework where each stage predicts initial guess for the next stage, which brings more performance gain. To fulfill the prediction task at each stage, we propose a network comprising Spatial Dense Graph Convolutional Networks (S-DGCN) and Temporal Dense Graph Convolutional Networks (T-DGCN). Alternatively executing the two networks helps extract spatiotemporal features over the global receptive field of the whole pose sequence. All the above design choices cooperating together make our method outperform previous approaches by large margins: 6%-7% on Human3.6M, 5%-10% on CMU-MoCap, and 13%-16% on 3DPW.

* Already accepted by CVPR2022 
Viaarxiv icon

CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation

Jan 11, 2022
Yu Qiao, Jincheng Zhu, Chengjiang Long, Zeyao Zhang, Yuxin Wang, Zhenjun Du, Xin Yang

Figure 1 for CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation
Figure 2 for CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation
Figure 3 for CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation
Figure 4 for CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation

Acquiring the most representative examples via active learning (AL) can benefit many data-dependent computer vision tasks by minimizing efforts of image-level or pixel-wise annotations. In this paper, we propose a novel Collaborative Panoptic-Regional Active Learning framework (CPRAL) to address the semantic segmentation task. For a small batch of images initially sampled with pixel-wise annotations, we employ panoptic information to initially select unlabeled samples. Considering the class imbalance in the segmentation dataset, we import a Regional Gaussian Attention module (RGA) to achieve semantics-biased selection. The subset is highlighted by vote entropy and then attended by Gaussian kernels to maximize the biased regions. We also propose a Contextual Labels Extension (CLE) to boost regional annotations with contextual attention guidance. With the collaboration of semantics-agnostic panoptic matching and regionbiased selection and extension, our CPRAL can strike a balance between labeling efforts and performance and compromise the semantics distribution. We perform extensive experiments on Cityscapes and BDD10K datasets and show that CPRAL outperforms the cutting-edge methods with impressive results and less labeling proportion.

* This is not the final version of our paper, and we will upload a final version later 
Viaarxiv icon