Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Li Niu

End-to-End Video Object Detection with Spatial-Temporal Transformers

May 23, 2021

Lu He, Qianyu Zhou, Xiangtai Li, Li Niu, Guangliang Cheng, Xiao Li, Wenxuan Liu, Yunhai Tong, Lizhuang Ma, Liqing Zhang

Figure 1 for End-to-End Video Object Detection with Spatial-Temporal Transformers

Figure 2 for End-to-End Video Object Detection with Spatial-Temporal Transformers

Figure 3 for End-to-End Video Object Detection with Spatial-Temporal Transformers

Figure 4 for End-to-End Video Object Detection with Spatial-Temporal Transformers

Abstract:Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection. Code will be made publicly available at https://github.com/SJTU-LuHe/TransVOD.

Via

Access Paper or Ask Questions

Shadow Generation for Composite Image in Real-world Scenes

Apr 21, 2021

Yan Hong, Li Niu, Jianfu Zhang, Liqing Zhang

Figure 1 for Shadow Generation for Composite Image in Real-world Scenes

Figure 2 for Shadow Generation for Composite Image in Real-world Scenes

Figure 3 for Shadow Generation for Composite Image in Real-world Scenes

Figure 4 for Shadow Generation for Composite Image in Real-world Scenes

Abstract:Image composition targets at inserting a foreground object on a background image. Most previous image composition methods focus on adjusting the foreground to make it compatible with background while ignoring the shadow effect of foreground on the background. In this work, we focus on generating plausible shadow for the foreground object in the composite image. First, we contribute a real-world shadow generation dataset DESOBA by generating synthetic composite images based on paired real images and deshadowed images. Then, we propose a novel shadow generation network SGRNet, which consists of a shadow mask prediction stage and a shadow filling stage. In the shadow mask prediction stage, foreground and background information are thoroughly interacted to generate foreground shadow mask. In the shadow filling stage, shadow parameters are predicted to fill the shadow area. Extensive experiments on our DESOBA dataset and real composite images demonstrate the effectiveness of our proposed method.

Via

Access Paper or Ask Questions

Inharmonious Region Localization

Apr 19, 2021

Jing Liang, Li Niu, Liqing Zhang

Figure 1 for Inharmonious Region Localization

Figure 2 for Inharmonious Region Localization

Figure 3 for Inharmonious Region Localization

Figure 4 for Inharmonious Region Localization

Abstract:The advance of image editing techniques allows users to create artistic works, but the manipulated regions may be incompatible with the background. Localizing the inharmonious region is an appealing yet challenging task. Realizing that this task requires effective aggregation of multi-scale contextual information and suppression of redundant information, we design novel Bi-directional Feature Integration (BFI) block and Global-context Guided Decoder (GGD) block to fuse multi-scale features in the encoder and decoder respectively. We also employ Mask-guided Dual Attention (MDA) block between the encoder and decoder to suppress the redundant information. Experiments on the image harmonization dataset demonstrate that our method achieves competitive performance for inharmonious region localization. The source code is available at https://github.com/bcmi/DIRL.

* 12 pages, 7 figures

Via

Access Paper or Ask Questions

Image Composition Assessment with Saliency-augmented Multi-pattern Pooling

Apr 07, 2021

Bo Zhang, Li Niu, Liqing Zhang

Figure 1 for Image Composition Assessment with Saliency-augmented Multi-pattern Pooling

Figure 2 for Image Composition Assessment with Saliency-augmented Multi-pattern Pooling

Figure 3 for Image Composition Assessment with Saliency-augmented Multi-pattern Pooling

Figure 4 for Image Composition Assessment with Saliency-augmented Multi-pattern Pooling

Abstract:Image composition assessment is crucial in aesthetic assessment, which aims to assess the overall composition quality of a given image. However, to the best of our knowledge, there is neither dataset nor method specifically proposed for this task. In this paper, we contribute the first composition assessment dataset CADB with composition scores for each image provided by multiple professional raters. Besides, we propose a composition assessment network SAMP-Net with a novel Saliency-Augmented Multi-pattern Pooling (SAMP) module, which analyses visual layout from the perspectives of multiple composition patterns. We also leverage composition-relevant attributes to further boost the performance, and extend Earth Mover's Distance (EMD) loss to weighted EMD loss to eliminate the content bias. The experimental results show that our SAMP-Net can perform more favorably than previous aesthetic assessment approaches and offer constructive composition suggestions.

Via

Access Paper or Ask Questions

Deep Image Harmonization by Bridging the Reality Gap

Mar 31, 2021

Wenyan Cong, Junyan Cao, Li Niu, Jianfu Zhang, Xuesong Gao, Zhiwei Tang, Liqing Zhang

Figure 1 for Deep Image Harmonization by Bridging the Reality Gap

Figure 2 for Deep Image Harmonization by Bridging the Reality Gap

Figure 3 for Deep Image Harmonization by Bridging the Reality Gap

Figure 4 for Deep Image Harmonization by Bridging the Reality Gap

Abstract:Image harmonization has been significantly advanced with large-scale harmonization dataset. However, the current way to build dataset is still labor-intensive, which adversely affects the extendability of dataset. To address this problem, we propose to construct a large-scale rendered harmonization dataset RHHarmony with fewer human efforts to augment the existing real-world dataset. To leverage both real-world images and rendered images, we propose a cross-domain harmonization network CharmNet to bridge the domain gap between two domains. Moreover, we also employ well-designed style classifiers and losses to facilitate cross-domain knowledge transfer. Extensive experiments demonstrate the potential of using rendered images for image harmonization and the effectiveness of our proposed network. Our dataset and code are available at https://github.com/bcmi/Rendered_Image_Harmonization_Datasets.

* 17 pages with supplementary

Via

Access Paper or Ask Questions

Disentangled Information Bottleneck

Dec 22, 2020

Ziqi Pan, Li Niu, Jianfu Zhang, Liqing Zhang

Figure 1 for Disentangled Information Bottleneck

Figure 2 for Disentangled Information Bottleneck

Figure 3 for Disentangled Information Bottleneck

Figure 4 for Disentangled Information Bottleneck

Abstract:The information bottleneck (IB) method is a technique for extracting information that is relevant for predicting the target random variable from the source random variable, which is typically implemented by optimizing the IB Lagrangian that balances the compression and prediction terms. However, the IB Lagrangian is hard to optimize, and multiple trials for tuning values of Lagrangian multiplier are required. Moreover, we show that the prediction performance strictly decreases as the compression gets stronger during optimizing the IB Lagrangian. In this paper, we implement the IB method from the perspective of supervised disentangling. Specifically, we introduce Disentangled Information Bottleneck (DisenIB) that is consistent on compressing source maximally without target prediction performance loss (maximum compression). Theoretical and experimental results demonstrate that our method is consistent on maximum compression, and performs well in terms of generalization, robustness to adversarial attack, out-of-distribution detection, and supervised disentangling.

* Revised mathematical proof

Via

Access Paper or Ask Questions

From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation

Sep 29, 2020

Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, Liqing Zhang

Figure 1 for From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation

Figure 2 for From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation

Figure 3 for From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation

Figure 4 for From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation

Abstract:Zero-shot learning has been actively studied for image classification task to relieve the burden of annotating image labels. Interestingly, semantic segmentation task requires more labor-intensive pixel-wise annotation, but zero-shot semantic segmentation has only attracted limited research interest. Thus, we focus on zero-shot semantic segmentation, which aims to segment unseen objects with only category-level semantic representations provided for unseen categories. In this paper, we propose a novel Context-aware feature Generation Network (CaGNet), which can synthesize context-aware pixel-wise visual features for unseen categories based on category-level semantic representations and pixel-wise contextual information. The synthesized features are used to finetune the classifier to enable segmenting unseen objects. Furthermore, we extend pixel-wise feature generation and finetuning to patch-wise feature generation and finetuning, which additionally considers inter-pixel relationship. Experimental results on Pascal-VOC, Pascal-Context, and COCO-stuff show that our method significantly outperforms the existing zero-shot semantic segmentation methods. Code is available at https://github.com/bcmi/CaGNetv2-Zero-Shot-Semantic-Segmentation.

* submitted to the TIP

Via

Access Paper or Ask Questions

Weak-shot Fine-grained Classification via Similarity Transfer

Sep 19, 2020

Junjie Chen, Li Niu, Liu Liu, Liqing Zhang

Figure 1 for Weak-shot Fine-grained Classification via Similarity Transfer

Figure 2 for Weak-shot Fine-grained Classification via Similarity Transfer

Figure 3 for Weak-shot Fine-grained Classification via Similarity Transfer

Figure 4 for Weak-shot Fine-grained Classification via Similarity Transfer

Abstract:Recognizing fine-grained categories remains a challenging task, due to the subtle distinctions among different subordinate categories, which results in the need of abundant annotated samples. To alleviate the data-hungry problem, we consider the problem of learning novel categories from web data with the support of a clean set of base categories, which is referred to as weak-shot learning. Under this setting, we propose to transfer pairwise semantic similarity from base categories to novel categories, because this similarity is highly transferable and beneficial for learning from web data. Specifically, we firstly train a similarity net on clean data, and then employ two simple yet effective strategies to leverage the transferred similarity to denoise web training data. In addition, we apply adversarial loss on similarity net to enhance the transferability of similarity. Comprehensive experiments on three fine-grained datasets demonstrate that we could dramatically facilitate webly supervised learning by a clean set and similarity transfer is effective under this setting.

Via

Access Paper or Ask Questions

BargainNet: Background-Guided Domain Translation for Image Harmonization

Sep 19, 2020

Wenyan Cong, Li Niu, Jianfu Zhang, Jing Liang, Liqing Zhang

Figure 1 for BargainNet: Background-Guided Domain Translation for Image Harmonization

Figure 2 for BargainNet: Background-Guided Domain Translation for Image Harmonization

Figure 3 for BargainNet: Background-Guided Domain Translation for Image Harmonization

Figure 4 for BargainNet: Background-Guided Domain Translation for Image Harmonization

Abstract:Image composition is a fundamental operation in image editing field. However, unharmonious foreground and background downgrade the quality of composite image. Image harmonization, which adjusts the foreground to improve the consistency, is an essential yet challenging task. Previous deep learning based methods mainly focus on directly learning the mapping from composite image to real image, while ignoring the crucial guidance role that background plays. In this work, with the assumption that the foreground needs to be translated to the same domain as background, we formulate image harmonization task as background-guided domain translation. Therefore, we propose an image harmonization network with a novel domain code extractor and well-tailored triplet losses, which could capture the background domain information to guide the foreground harmonization. Extensive experiments on the existing image harmonization benchmark demonstrate the effectiveness of our proposed method.

Via

Access Paper or Ask Questions

DeltaGAN: Towards Diverse Few-shot Image Generation with Sample-Specific Delta

Sep 18, 2020

Yan Hong, Li Niu, Jianfu Zhang, Jing Liang, Liqing Zhang

Figure 1 for DeltaGAN: Towards Diverse Few-shot Image Generation with Sample-Specific Delta

Figure 2 for DeltaGAN: Towards Diverse Few-shot Image Generation with Sample-Specific Delta

Figure 3 for DeltaGAN: Towards Diverse Few-shot Image Generation with Sample-Specific Delta

Figure 4 for DeltaGAN: Towards Diverse Few-shot Image Generation with Sample-Specific Delta

Abstract:Learning to generate new images for a novel category based on only a few images, named as few-shot image generation, has attracted increasing research interest. Several state-of-the-art works have yielded impressive results, but the diversity is still limited. In this work, we propose a novel Delta Generative Adversarial Network (DeltaGAN), which consists of a reconstruction subnetwork and a generation subnetwork. The reconstruction subnetwork captures intra-category transformation, i.e., "delta", between same-category pairs. The generation subnetwork generates sample-specific "delta" for an input image, which is combined with this input image to generate a new image within the same category. Besides, an adversarial delta matching loss is designed to link the above two subnetworks together. Extensive experiments on five few-shot image datasets demonstrate the effectiveness of our proposed method.

Via

Access Paper or Ask Questions