Alert button
Picture for Luoqi Liu

Luoqi Liu

Alert button

Towards Consistent Video Editing with Text-to-Image Diffusion Models

May 27, 2023
Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, Luoqi Liu

Figure 1 for Towards Consistent Video Editing with Text-to-Image Diffusion Models
Figure 2 for Towards Consistent Video Editing with Text-to-Image Diffusion Models
Figure 3 for Towards Consistent Video Editing with Text-to-Image Diffusion Models
Figure 4 for Towards Consistent Video Editing with Text-to-Image Diffusion Models

Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner. Despite their low requirements of data and computation, these methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence, limiting their applications in the real world. In this paper, we propose to address the above issues with a novel EI$^2$ model towards \textbf{E}nhancing v\textbf{I}deo \textbf{E}diting cons\textbf{I}stency of TTI-based frameworks. Specifically, we analyze and find that the inconsistent problem is caused by newly added modules into TTI models for learning temporal information. These modules lead to covariate shift in the feature space, which harms the editing capability. Thus, we design EI$^2$ to tackle the above drawbacks with two classical modules: Shift-restricted Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM). First, through theoretical analysis, we demonstrate that covariate shift is highly related to Layer Normalization, thus STAM employs a \textit{Instance Centering} layer replacing it to preserve the distribution of temporal features. In addition, {STAM} employs an attention layer with normalized mapping to transform temporal features while constraining the variance shift. As the second part, we incorporate {STAM} with a novel {FFAM}, which efficiently leverages fine-coarse spatial information of overall frames to further enhance temporal consistency. Extensive experiments demonstrate the superiority of the proposed EI$^2$ model for text-driven video editing.

Viaarxiv icon

Shadow Removal by High-Quality Shadow Synthesis

Dec 08, 2022
Yunshan Zhong, Mingbao Lin, Lizhou You, Yuxin Zhang, Luoqi Liu, Rongrong Ji

Figure 1 for Shadow Removal by High-Quality Shadow Synthesis
Figure 2 for Shadow Removal by High-Quality Shadow Synthesis
Figure 3 for Shadow Removal by High-Quality Shadow Synthesis
Figure 4 for Shadow Removal by High-Quality Shadow Synthesis

Most shadow removal methods rely on the invasion of training images associated with laborious and lavish shadow region annotations, leading to the increasing popularity of shadow image synthesis. However, the poor performance also stems from these synthesized images since they are often shadow-inauthentic and details-impaired. In this paper, we present a novel generation framework, referred to as HQSS, for high-quality pseudo shadow image synthesis. The given image is first decoupled into a shadow region identity and a non-shadow region identity. HQSS employs a shadow feature encoder and a generator to synthesize pseudo images. Specifically, the encoder extracts the shadow feature of a region identity which is then paired with another region identity to serve as the generator input to synthesize a pseudo image. The pseudo image is expected to have the shadow feature as its input shadow feature and as well as a real-like image detail as its input region identity. To fulfill this goal, we design three learning objectives. When the shadow feature and input region identity are from the same region identity, we propose a self-reconstruction loss that guides the generator to reconstruct an identical pseudo image as its input. When the shadow feature and input region identity are from different identities, we introduce an inter-reconstruction loss and a cycle-reconstruction loss to make sure that shadow characteristics and detail information can be well retained in the synthesized images. Our HQSS is observed to outperform the state-of-the-art methods on ISTD dataset, Video Shadow Removal dataset, and SRD dataset. The code is available at https://github.com/zysxmu/HQSS.

Viaarxiv icon

Multi-view Human Body Mesh Translator

Oct 04, 2022
Xiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, Si Liu

Figure 1 for Multi-view Human Body Mesh Translator
Figure 2 for Multi-view Human Body Mesh Translator
Figure 3 for Multi-view Human Body Mesh Translator
Figure 4 for Multi-view Human Body Mesh Translator

Existing methods for human mesh recovery mainly focus on single-view frameworks, but they often fail to produce accurate results due to the ill-posed setup. Considering the maturity of the multi-view motion capture system, in this paper, we propose to solve the prior ill-posed problem by leveraging multiple images from different views, thus significantly enhancing the quality of recovered meshes. In particular, we present a novel \textbf{M}ulti-view human body \textbf{M}esh \textbf{T}ranslator (MMT) model for estimating human body mesh with the help of vision transformer. Specifically, MMT takes multi-view images as input and translates them to targeted meshes in a single-forward manner. MMT fuses features of different views in both encoding and decoding phases, leading to representations embedded with global information. Additionally, to ensure the tokens are intensively focused on the human pose and shape, MMT conducts cross-view alignment at the feature level by projecting 3D keypoint positions to each view and enforcing their consistency in geometry constraints. Comprehensive experiments demonstrate that MMT outperforms existing single or multi-view models by a large margin for human mesh recovery task, notably, 28.8\% improvement in MPVE over the current state-of-the-art method on the challenging HUMBI dataset. Qualitative evaluation also verifies the effectiveness of MMT in reconstructing high-quality human mesh. Codes will be made available upon acceptance.

* 9 pages 
Viaarxiv icon

DropKey

Aug 08, 2022
Bonan Li, Yinhan Hu, Xuecheng Nie, Congying Han, Xiangjian Jiang, Tiande Guo, Luoqi Liu

Figure 1 for DropKey
Figure 2 for DropKey
Figure 3 for DropKey
Figure 4 for DropKey

In this paper, we focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer, which is important while surprisingly ignored by prior works. In particular, we conduct researches on three core questions: First, what to drop in self-attention layers? Different from dropping attention weights in literature, we propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically verify that this scheme helps keep both regularization and probability features of attention weights, alleviating the overfittings problem to specific patterns and enhancing the model to globally capture vital information; Second, how to schedule the drop ratio in consecutive layers? In contrast to exploit a constant drop ratio for all layers, we present a new decreasing schedule that gradually decreases the drop ratio along the stack of self-attention layers. We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics, thus improving the robustness and stableness of model training; Third, whether need to perform structured dropout operation as CNN? We attempt patch-based block-version of dropout operation and find that this useful trick for CNN is not essential for ViT. Given exploration on the above three questions, we present the novel DropKey method that regards Key as the drop unit and exploits decreasing schedule for drop ratio, improving ViTs in a general way. Comprehensive experiments demonstrate the effectiveness of DropKey for various ViT architectures, e.g. T2T and VOLO, as well as for various vision tasks, e.g., image classification, object detection, human-object interaction detection and human body shape recovery. Codes will be released upon acceptance.

* 11 Pages 
Viaarxiv icon

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

Nov 24, 2021
David Junhao Zhang, Kunchang Li, Yunpeng Chen, Yali Wang, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

Figure 1 for MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
Figure 2 for MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
Figure 3 for MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
Figure 4 for MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

Self-attention has become an integral component of the recent network architectures, e.g., Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same reason, researchers make attempts recently to revive Multiple Layer Perceptron (MLP) and propose a few MLP-Like architectures, showing great potential. However, the current MLP-Like architectures are not good at capturing local details and lack progressive understanding of core details in the images and/or videos. To overcome this issue, we propose a novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the high-level layers. Specifically, we design a Fully-Connected-Like layer, dubbed as MorphFC, of two morphable filters that gradually grow its receptive field along the height and width dimension. More interestingly, we propose to flexibly adapt our MorphFC layer in the video domain. To our best knowledge, we are the first to create a MLP-Like backbone for learning video representation. Finally, we conduct extensive experiments on image classification, semantic segmentation and video classification. Our MorphMLP, such a self-attention free backbone, can be as powerful as and even outperform self-attention based models.

* preprint version 
Viaarxiv icon

Referring Image Segmentation via Cross-Modal Progressive Comprehension

Oct 01, 2020
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li

Figure 1 for Referring Image Segmentation via Cross-Modal Progressive Comprehension
Figure 2 for Referring Image Segmentation via Cross-Modal Progressive Comprehension
Figure 3 for Referring Image Segmentation via Cross-Modal Progressive Comprehension
Figure 4 for Referring Image Segmentation via Cross-Modal Progressive Comprehension

Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances.

* Accepted by CVPR 2020. Code is available at https://github.com/spyflying/CMPC-Refseg 
Viaarxiv icon

Smart Mirror: Intelligent Makeup Recommendation and Synthesis

Sep 22, 2017
Tam V. Nguyen, Luoqi Liu

Figure 1 for Smart Mirror: Intelligent Makeup Recommendation and Synthesis
Figure 2 for Smart Mirror: Intelligent Makeup Recommendation and Synthesis
Figure 3 for Smart Mirror: Intelligent Makeup Recommendation and Synthesis

The female facial image beautification usually requires professional editing softwares, which are relatively difficult for common users. In this demo, we introduce a practical system for automatic and personalized facial makeup recommendation and synthesis. First, a model describing the relations among facial features, facial attributes and makeup attributes is learned as the makeup recommendation model for suggesting the most suitable makeup attributes. Then the recommended makeup attributes are seamlessly synthesized onto the input facial image.

* accepted to ACM MM 2017 
Viaarxiv icon

Salient Object Detection with Semantic Priors

May 23, 2017
Tam V. Nguyen, Luoqi Liu

Figure 1 for Salient Object Detection with Semantic Priors
Figure 2 for Salient Object Detection with Semantic Priors
Figure 3 for Salient Object Detection with Semantic Priors
Figure 4 for Salient Object Detection with Semantic Priors

Salient object detection has increasingly become a popular topic in cognitive and computational sciences, including computer vision and artificial intelligence research. In this paper, we propose integrating \textit{semantic priors} into the salient object detection process. Our algorithm consists of three basic steps. Firstly, the explicit saliency map is obtained based on the semantic segmentation refined by the explicit saliency priors learned from the data. Next, the implicit saliency map is computed based on a trained model which maps the implicit saliency priors embedded into regional features with the saliency values. Finally, the explicit semantic map and the implicit map are adaptively fused to form a pixel-accurate saliency map which uniformly covers the objects of interest. We further evaluate the proposed framework on two challenging datasets, namely, ECSSD and HKUIS. The extensive experimental results demonstrate that our method outperforms other state-of-the-art methods.

* accepted to IJCAI 2017 
Viaarxiv icon

Peak-Piloted Deep Network for Facial Expression Recognition

Jan 03, 2017
Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, Shuicheng Yan

Figure 1 for Peak-Piloted Deep Network for Facial Expression Recognition
Figure 2 for Peak-Piloted Deep Network for Facial Expression Recognition
Figure 3 for Peak-Piloted Deep Network for Facial Expression Recognition
Figure 4 for Peak-Piloted Deep Network for Facial Expression Recognition

Objective functions for training of deep networks for face-related recognition tasks, such as facial expression recognition (FER), usually consider each sample independently. In this work, we present a novel peak-piloted deep network (PPDN) that uses a sample with peak expression (easy sample) to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject. The expression evolving process from non-peak expression to peak expression can thus be implicitly embedded in the network to achieve the invariance to expression intensities. A special purpose back-propagation procedure, peak gradient suppression (PGS), is proposed for network training. It drives the intermediate-layer feature responses of non-peak expression samples towards those of the corresponding peak expression samples, while avoiding the inverse. This avoids degrading the recognition capability for samples of peak expression due to interference from their non-peak expression counterparts. Extensive comparisons on two popular FER datasets, Oulu-CASIA and CK+, demonstrate the superiority of the PPDN over state-ofthe-art FER methods, as well as the advantages of both the network structure and the optimization strategy. Moreover, it is shown that PPDN is a general architecture, extensible to other tasks by proper definition of peak and non-peak samples. This is validated by experiments that show state-of-the-art performance on pose-invariant face recognition, using the Multi-PIE dataset.

* Published in ECCV 2016 
Viaarxiv icon

Video Scene Parsing with Predictive Feature Learning

Dec 13, 2016
Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, Shuicheng Yan

Figure 1 for Video Scene Parsing with Predictive Feature Learning
Figure 2 for Video Scene Parsing with Predictive Feature Learning
Figure 3 for Video Scene Parsing with Predictive Feature Learning
Figure 4 for Video Scene Parsing with Predictive Feature Learning

In this work, we address the challenging video scene parsing problem by developing effective representation learning methods given limited parsing annotations. In particular, we contribute two novel methods that constitute a unified parsing framework. (1) \textbf{Predictive feature learning}} from nearly unlimited unlabeled video data. Different from existing methods learning features from single frame parsing, we learn spatiotemporal discriminative features by enforcing a parsing network to predict future frames and their parsing maps (if available) given only historical frames. In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations. (2) \textbf{Prediction steering parsing}} architecture that effectively adapts the learned spatiotemporal features to scene parsing tasks and provides strong guidance for any off-the-shelf parsing model to achieve better video scene parsing performance. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our methods by showing significant improvement over well-established baselines.

* 15 pages, 7 figures, 5 tables, currently v2 
Viaarxiv icon