Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Libo Zhang

National University of Defense Technology, Changsha, China

SDF-3DGAN: A 3D Object Generative Method Based on Implicit Signed Distance Function

Mar 13, 2023

Lutao Jiang, Ruyi Ji, Libo Zhang

Abstract:In this paper, we develop a new method, termed SDF-3DGAN, for 3D object generation and 3D-Aware image synthesis tasks, which introduce implicit Signed Distance Function (SDF) as the 3D object representation method in the generative field. We apply SDF for higher quality representation of 3D object in space and design a new SDF neural renderer, which has higher efficiency and higher accuracy. To train only on 2D images, we first generate the objects, which are represented by SDF, from Gaussian distribution. Then we render them to 2D images and use them to apply GAN training method together with 2D images in the dataset. In the new rendering method, we relieve all the potential of SDF mathematical property to alleviate computation pressure in the previous SDF neural renderer. In specific, our new SDF neural renderer can solve the problem of sampling ambiguity when the number of sampling point is not enough, \ie use the less points to finish higher quality sampling task in the rendering pipeline. And in this rendering pipeline, we can locate the surface easily. Therefore, we apply normal loss on it to control the smoothness of generated object surface, which can make our method enjoy the much higher generation quality. Quantitative and qualitative experiments conducted on public benchmarks demonstrate favorable performance against the state-of-the-art methods in 3D object generation task and 3D-Aware image synthesis task. Our codes will be released at https://github.com/lutao2021/SDF-3DGAN.

Via

Access Paper or Ask Questions

Learning Density-Based Correlated Equilibria for Markov Games

Feb 16, 2023

Libo Zhang, Yang Chen, Toru Takisaka, Bakh Khoussainov, Michael Witbrock, Jiamou Liu

Abstract:Correlated Equilibrium (CE) is a well-established solution concept that captures coordination among agents and enjoys good algorithmic properties. In real-world multi-agent systems, in addition to being in an equilibrium, agents' policies are often expected to meet requirements with respect to safety, and fairness. Such additional requirements can often be expressed in terms of the state density which measures the state-visitation frequencies during the course of a game. However, existing CE notions or CE-finding approaches cannot explicitly specify a CE with particular properties concerning state density; they do so implicitly by either modifying reward functions or using value functions as the selection criteria. The resulting CE may thus not fully fulfil the state-density requirements. In this paper, we propose Density-Based Correlated Equilibria (DBCE), a new notion of CE that explicitly takes state density as selection criterion. Concretely, we instantiate DBCE by specifying different state-density requirements motivated by real-world applications. To compute DBCE, we put forward the Density Based Correlated Policy Iteration algorithm for the underlying control problem. We perform experiments on various games where results demonstrate the advantage of our CE-finding approach over existing methods in scenarios with state-density concerns.

Via

Access Paper or Ask Questions

Robust Domain Adaptive Object Detection with Unified Multi-Granularity Alignment

Jan 01, 2023

Libo Zhang, Wenzhang Zhou, Heng Fan, Tiejian Luo, Haibin Ling

Figure 1 for Robust Domain Adaptive Object Detection with Unified Multi-Granularity Alignment

Figure 2 for Robust Domain Adaptive Object Detection with Unified Multi-Granularity Alignment

Figure 3 for Robust Domain Adaptive Object Detection with Unified Multi-Granularity Alignment

Figure 4 for Robust Domain Adaptive Object Detection with Unified Multi-Granularity Alignment

Abstract:Domain adaptive detection aims to improve the generalization of detectors on target domain. To reduce discrepancy in feature distributions between two domains, recent approaches achieve domain adaption through feature alignment in different granularities via adversarial learning. However, they neglect the relationship between multiple granularities and different features in alignment, degrading detection. Addressing this, we introduce a unified multi-granularity alignment (MGA)-based detection framework for domain-invariant feature learning. The key is to encode the dependencies across different granularities including pixel-, instance-, and category-levels simultaneously to align two domains. Specifically, based on pixel-level features, we first develop an omni-scale gated fusion (OSGF) module to aggregate discriminative representations of instances with scale-aware convolutions, leading to robust multi-scale detection. Besides, we introduce multi-granularity discriminators to identify where, either source or target domains, different granularities of samples come from. Note that, MGA not only leverages instance discriminability in different categories but also exploits category consistency between two domains for detection. Furthermore, we present an adaptive exponential moving average (AEMA) strategy that explores model assessments for model update to improve pseudo labels and alleviate local misalignment problem, boosting detection robustness. Extensive experiments on multiple domain adaption scenarios validate the superiority of MGA over other approaches on FCOS and Faster R-CNN detectors. Code will be released at https://github.com/tiankongzhang/MGA.

Via

Access Paper or Ask Questions

PIDray: A Large-scale X-ray Benchmark for Real-World Prohibited Item Detection

Nov 19, 2022

Libo Zhang, Lutao Jiang, Ruyi Ji, Heng Fan

Abstract:Automatic security inspection relying on computer vision technology is a challenging task in real-world scenarios due to many factors, such as intra-class variance, class imbalance, and occlusion. Most previous methods rarely touch the cases where the prohibited items are deliberately hidden in messy objects because of the scarcity of large-scale datasets, hindering their applications. To address this issue and facilitate related research, we present a large-scale dataset, named PIDray, which covers various cases in real-world scenarios for prohibited item detection, especially for deliberately hidden items. In specific, PIDray collects 124,486 X-ray images for $12$ categories of prohibited items, and each image is manually annotated with careful inspection, which makes it, to our best knowledge, to largest prohibited items detection dataset to date. Meanwhile, we propose a general divide-and-conquer pipeline to develop baseline algorithms on PIDray. Specifically, we adopt the tree-like structure to suppress the influence of the long-tailed issue in the PIDray dataset, where the first course-grained node is tasked with the binary classification to alleviate the influence of head category, while the subsequent fine-grained node is dedicated to the specific tasks of the tail categories. Based on this simple yet effective scheme, we offer strong task-specific baselines across object detection, instance segmentation, and multi-label classification tasks and verify the generalization ability on common datasets (e.g., COCO and PASCAL VOC). Extensive experiments on PIDray demonstrate that the proposed method performs favorably against current state-of-the-art methods, especially for deliberately hidden items. Our benchmark and codes will be released at https://github.com/lutao2021/PIDray.

* Tech. report. arXiv admin note: text overlap with arXiv:2108.07020

Via

Access Paper or Ask Questions

High-Fidelity Image Inpainting with GAN Inversion

Aug 25, 2022

Yongsheng Yu, Libo Zhang, Heng Fan, Tiejian Luo

Figure 1 for High-Fidelity Image Inpainting with GAN Inversion

Figure 2 for High-Fidelity Image Inpainting with GAN Inversion

Figure 3 for High-Fidelity Image Inpainting with GAN Inversion

Figure 4 for High-Fidelity Image Inpainting with GAN Inversion

Abstract:Image inpainting seeks a semantically consistent way to recover the corrupted image in the light of its unmasked content. Previous approaches usually reuse the well-trained GAN as effective prior to generate realistic patches for missing holes with GAN inversion. Nevertheless, the ignorance of a hard constraint in these algorithms may yield the gap between GAN inversion and image inpainting. Addressing this problem, in this paper, we devise a novel GAN inversion model for image inpainting, dubbed InvertFill, mainly consisting of an encoder with a pre-modulation module and a GAN generator with F&W+ latent space. Within the encoder, the pre-modulation network leverages multi-scale structures to encode more discriminative semantics into style vectors. In order to bridge the gap between GAN inversion and image inpainting, F&W+ latent space is proposed to eliminate glaring color discrepancy and semantic inconsistency. To reconstruct faithful and photorealistic images, a simple yet effective Soft-update Mean Latent module is designed to capture more diverse in-domain patterns that synthesize high-fidelity textures for large corruptions. Comprehensive experiments on four challenging datasets, including Places2, CelebA-HQ, MetFaces, and Scenery, demonstrate that our InvertFill outperforms the advanced approaches qualitatively and quantitatively and supports the completion of out-of-domain images well.

* Accepted to ECCV2022

Via

Access Paper or Ask Questions

Unbiased Multi-Modality Guidance for Image Inpainting

Aug 25, 2022

Yongsheng Yu, Dawei Du, Libo Zhang, Tiejian Luo

Figure 1 for Unbiased Multi-Modality Guidance for Image Inpainting

Figure 2 for Unbiased Multi-Modality Guidance for Image Inpainting

Figure 3 for Unbiased Multi-Modality Guidance for Image Inpainting

Figure 4 for Unbiased Multi-Modality Guidance for Image Inpainting

Abstract:Image inpainting is an ill-posed problem to recover missing or damaged image content based on incomplete images with masks. Previous works usually predict the auxiliary structures (e.g., edges, segmentation and contours) to help fill visually realistic patches in a multi-stage fashion. However, imprecise auxiliary priors may yield biased inpainted results. Besides, it is time-consuming for some methods to be implemented by multiple stages of complex neural networks. To solve this issue, we develop an end-to-end multi-modality guided transformer network, including one inpainting branch and two auxiliary branches for semantic segmentation and edge textures. Within each transformer block, the proposed multi-scale spatial-aware attention module can learn the multi-modal structural features efficiently via auxiliary denormalization. Different from previous methods relying on direct guidance from biased priors, our method enriches semantically consistent context in an image based on discriminative interplay information from multiple modalities. Comprehensive experiments on several challenging image inpainting datasets show that our method achieves state-of-the-art performance to deal with various regular/irregular masks efficiently.

* Accepted to ECCV 2022

Via

Access Paper or Ask Questions

AutoTransition: Learning to Recommend Video Transition Effects

Jul 27, 2022

Yaojie Shen, Libo Zhang, Kai Xu, Xiaojie Jin

Figure 1 for AutoTransition: Learning to Recommend Video Transition Effects

Figure 2 for AutoTransition: Learning to Recommend Video Transition Effects

Figure 3 for AutoTransition: Learning to Recommend Video Transition Effects

Figure 4 for AutoTransition: Learning to Recommend Video Transition Effects

Abstract:Video transition effects are widely used in video editing to connect shots for creating cohesive and visually appealing videos. However, it is challenging for non-professionals to choose best transitions due to the lack of cinematographic knowledge and design skills. In this paper, we present the premier work on performing automatic video transitions recommendation (VTR): given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. To solve this task, we collect a large-scale video transition dataset using publicly available video templates on editing softwares. Then we formulate VTR as a multi-modal retrieval problem from vision/audio to video transitions and propose a novel multi-modal matching framework which consists of two parts. First we learn the embedding of video transitions through a video transition classification task. Then we propose a model to learn the matching correspondence from vision/audio inputs to video transitions. Specifically, the proposed model employs a multi-modal transformer to fuse vision and audio information, as well as capture the context cues in sequential transition outputs. Through both quantitative and qualitative experiments, we clearly demonstrate the effectiveness of our method. Notably, in the comprehensive user study, our method receives comparable scores compared with professional editors while improving the video editing efficiency by \textbf{300\scalebox{1.25}{$\times$}}. We hope our work serves to inspire other researchers to work on this new task. The dataset and codes are public at \url{https://github.com/acherstyx/AutoTransition}.

* To appear at ECCV 2022

Via

Access Paper or Ask Questions

Dual-Stream Transformer for Generic Event Boundary Captioning

Jul 07, 2022

Xin Gu, Hanhua Ye, Guang Chen, Yufei Wang, Libo Zhang, Longyin Wen

Figure 1 for Dual-Stream Transformer for Generic Event Boundary Captioning

Figure 2 for Dual-Stream Transformer for Generic Event Boundary Captioning

Figure 3 for Dual-Stream Transformer for Generic Event Boundary Captioning

Figure 4 for Dual-Stream Transformer for Generic Event Boundary Captioning

Abstract:This paper describes our champion solution for the CVPR2022 Generic Event Boundary Captioning (GEBC) competition. GEBC requires the captioning model to have a comprehension of instantaneous status changes around the given video boundary, which makes it much more challenging than conventional video captioning task. In this paper, a Dual-Stream Transformer with improvements on both video content encoding and captions generation is proposed: (1) We utilize three pre-trained models to extract the video features from different granularities. Moreover, we exploit the types of boundary as hints to help the model generate captions. (2) We particularly design an model, termed as Dual-Stream Transformer, to learn discriminative representations for boundary captioning. (3) Towards generating content-relevant and human-like captions, we improve the description quality by designing a word-level ensemble strategy. The promising results on the GEBC test split demonstrate the efficacy of our proposed model.

Via

Access Paper or Ask Questions

Structured Context Transformer for Generic Event Boundary Detection

Jun 07, 2022

Congcong Li, Xinyao Wang, Dexiang Hong, Yufei Wang, Libo Zhang, Tiejian Luo, Longyin Wen

Figure 1 for Structured Context Transformer for Generic Event Boundary Detection

Figure 2 for Structured Context Transformer for Generic Event Boundary Detection

Figure 3 for Structured Context Transformer for Generic Event Boundary Detection

Figure 4 for Structured Context Transformer for Generic Event Boundary Detection

Abstract:Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to-end fashion. Specifically, we use the backbone convolutional neural network (CNN) to extract the features of each video frame. To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence. Note that, the overall computation complexity of SC-Transformer is linear to the video length. After that, the group similarities are computed to capture the differences between frames. Then, a lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, the Gaussian kernel is adopted to preprocess the ground-truth event boundaries to further boost the accuracy. Extensive experiments conducted on the challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.

Via

Access Paper or Ask Questions

AnimalTrack: A Large-scale Benchmark for Multi-Animal Tracking in the Wild

Apr 30, 2022

Libo Zhang, Junyuan Gao, Zhen Xiao, Heng Fan

Figure 1 for AnimalTrack: A Large-scale Benchmark for Multi-Animal Tracking in the Wild

Figure 2 for AnimalTrack: A Large-scale Benchmark for Multi-Animal Tracking in the Wild

Figure 3 for AnimalTrack: A Large-scale Benchmark for Multi-Animal Tracking in the Wild

Figure 4 for AnimalTrack: A Large-scale Benchmark for Multi-Animal Tracking in the Wild

Abstract:Multi-animal tracking (MAT), a multi-object tracking (MOT) problem, is crucial for animal motion and behavior analysis and has many crucial applications such as biology, ecology, animal conservation and so forth. Despite its importance, MAT is largely under-explored compared to other MOT problems such as multi-human tracking due to the scarcity of large-scale benchmark. To address this problem, we introduce AnimalTrack, a large-scale benchmark for multi-animal tracking in the wild. Specifically, AnimalTrack consists of 58 sequences from a diverse selection of 10 common animal categories. On average, each sequence comprises of 33 target objects for tracking. In order to ensure high quality, every frame in AnimalTrack is manually labeled with careful inspection and refinement. To our best knowledge, AnimalTrack is the first benchmark dedicated to multi-animal tracking. In addition, to understand how existing MOT algorithms perform on AnimalTrack and provide baselines for future comparison, we extensively evaluate 14 state-of-the-art representative trackers. The evaluation results demonstrate that, not surprisingly, most of these trackers become degenerated due to the differences between pedestrians and animals in various aspects (e.g., pose, motion, appearance, etc), and more efforts are desired to improve multi-animal tracking. We hope that AnimalTrack together with evaluation and analysis will foster further progress on multi-animal tracking. The dataset and evaluation as well as our analysis will be made available upon the acceptance.

* Tech. report

Via

Access Paper or Ask Questions