Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yandong Guo

The 3rd Anti-UAV Workshop & Challenge: Methods and Results

May 12, 2023

Jian Zhao, Jianan Li, Lei Jin, Jiaming Chu, Zhihao Zhang, Jun Wang, Jiangqiang Xia, Kai Wang, Yang Liu, Sadaf Gulshad(+13 more)

Figure 1 for The 3rd Anti-UAV Workshop & Challenge: Methods and Results

Figure 2 for The 3rd Anti-UAV Workshop & Challenge: Methods and Results

Figure 3 for The 3rd Anti-UAV Workshop & Challenge: Methods and Results

Abstract:The 3rd Anti-UAV Workshop & Challenge aims to encourage research in developing novel and accurate methods for multi-scale object tracking. The Anti-UAV dataset used for the Anti-UAV Challenge has been publicly released. There are two main differences between this year's competition and the previous two. First, we have expanded the existing dataset, and for the first time, released a training set so that participants can focus on improving their models. Second, we set up two tracks for the first time, i.e., Anti-UAV Tracking and Anti-UAV Detection & Tracking. Around 76 participating teams from the globe competed in the 3rd Anti-UAV Challenge. In this paper, we provide a brief summary of the 3rd Anti-UAV Workshop & Challenge including brief introductions to the top three methods in each track. The submission leaderboard will be reopened for researchers that are interested in the Anti-UAV challenge. The benchmark dataset and other information can be found at: https://anti-uav.github.io/.

* Technical report for 3rd Anti-UAV Workshop and Challenge. arXiv admin note: text overlap with arXiv:2108.09909

Via

Access Paper or Ask Questions

ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

Apr 25, 2023

Xiangze Jia, Hui Zhou, Xinge Zhu, Yandong Guo, Ji Zhang, Yuexin Ma

Figure 1 for ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

Figure 2 for ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

Figure 3 for ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

Figure 4 for ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

Abstract:In this paper, we propose a novel self-supervised motion estimator for LiDAR-based autonomous driving via BEV representation. Different from usually adopted self-supervised strategies for data-level structure consistency, we predict scene motion via feature-level consistency between pillars in consecutive frames, which can eliminate the effect caused by noise points and view-changing point clouds in dynamic scenes. Specifically, we propose \textit{Soft Discriminative Loss} that provides the network with more pseudo-supervised signals to learn discriminative and robust features in a contrastive learning manner. We also propose \textit{Gated Multi-frame Fusion} block that learns valid compensation between point cloud frames automatically to enhance feature extraction. Finally, \textit{pillar association} is proposed to predict pillar correspondence probabilities based on feature distance, and whereby further predicts scene motion. Extensive experiments show the effectiveness and superiority of our \textbf{ContrastMotion} on both scene flow and motion prediction tasks. The code is available soon.

Via

Access Paper or Ask Questions

CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input

Apr 13, 2023

Senmao Tian, Ming Lu, Jiaming Liu, Yandong Guo, Yurong Chen, Shunli Zhang

Figure 1 for CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input

Figure 2 for CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input

Figure 3 for CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input

Figure 4 for CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input

Abstract:With the development of high-definition display devices, the practical scenario of Super-Resolution (SR) usually needs to super-resolve large input like 2K to higher resolution (4K/8K). To reduce the computational and memory cost, current methods first split the large input into local patches and then merge the SR patches into the output. These methods adaptively allocate a subnet for each patch. Quantization is a very important technique for network acceleration and has been used to design the subnets. Current methods train an MLP bit selector to determine the propoer bit for each layer. However, they uniformly sample subnets for training, making simple subnets overfitted and complicated subnets underfitted. Therefore, the trained bit selector fails to determine the optimal bit. Apart from this, the introduced bit selector brings additional cost to each layer of the SR network. In this paper, we propose a novel method named Content-Aware Bit Mapping (CABM), which can remove the bit selector without any performance loss. CABM also learns a bit selector for each layer during training. After training, we analyze the relation between the edge information of an input patch and the bit of each layer. We observe that the edge information can be an effective metric for the selected bit. Therefore, we design a strategy to build an Edge-to-Bit lookup table that maps the edge score of a patch to the bit of each layer during inference. The bit configuration of SR network can be determined by the lookup tables of all layers. Our strategy can find better bit configuration, resulting in more efficient mixed precision networks. We conduct detailed experiments to demonstrate the generalization ability of our method. The code will be released.

* Accepted to CVPR2023

Via

Access Paper or Ask Questions

A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution

Apr 13, 2023

Huicheng Pi, Senmao Tian, Ming Lu, Jiaming Liu, Yandong Guo, Shunli Zhang

Figure 1 for A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution

Figure 2 for A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution

Figure 3 for A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution

Figure 4 for A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution

Abstract:Super-Resolution (SR) has gained increasing research attention over the past few years. With the development of Deep Neural Networks (DNNs), many super-resolution methods based on DNNs have been proposed. Although most of these methods are aimed at ordinary frames, there are few works on super-resolution of omnidirectional frames. In these works, omnidirectional frames are projected from the 3D sphere to a 2D plane by Equi-Rectangular Projection (ERP). Although ERP has been widely used for projection, it has severe projection distortion near poles. Current DNN-based SR methods use 2D convolution modules, which is more suitable for the regular grid. In this paper, we find that different projection methods have great impact on the performance of DNNs. To study this problem, a comprehensive comparison of projections in omnidirectional super-resolution is conducted. We compare the SR results of different projection methods. Experimental results show that Equi-Angular cube map projection (EAC), which has minimal distortion, achieves the best result in terms of WS-PSNR compared with other projections. Code and data will be released.

* Accepted to ICASSP2023

Via

Access Paper or Ask Questions

SGL: Structure Guidance Learning for Camera Localization

Apr 12, 2023

Xudong Zhang, Shuang Gao, Xiaohu Nan, Haikuan Ning, Yuchen Yang, Yishan Ping, Jixiang Wan, Shuzhou Dong, Jijunnan Li, Yandong Guo

Figure 1 for SGL: Structure Guidance Learning for Camera Localization

Figure 2 for SGL: Structure Guidance Learning for Camera Localization

Figure 3 for SGL: Structure Guidance Learning for Camera Localization

Figure 4 for SGL: Structure Guidance Learning for Camera Localization

Abstract:Camera localization is a classical computer vision task that serves various Artificial Intelligence and Robotics applications. With the rapid developments of Deep Neural Networks (DNNs), end-to-end visual localization methods are prosperous in recent years. In this work, we focus on the scene coordinate prediction ones and propose a network architecture named as Structure Guidance Learning (SGL) which utilizes the receptive branch and the structure branch to extract both high-level and low-level features to estimate the 3D coordinates. We design a confidence strategy to refine and filter the predicted 3D observations, which enables us to estimate the camera poses by employing the Perspective-n-Point (PnP) with RANSAC. In the training part, we design the Bundle Adjustment trainer to help the network fit the scenes better. Comparisons with some state-of-the-art (SOTA) methods and sufficient ablation experiments confirm the validity of our proposed architecture.

Via

Access Paper or Ask Questions

CloSET: Modeling Clothed Humans on Continuous Surface with Explicit Template Decomposition

Apr 06, 2023

Hongwen Zhang, Siyou Lin, Ruizhi Shao, Yuxiang Zhang, Zerong Zheng, Han Huang, Yandong Guo, Yebin Liu

Abstract:Creating animatable avatars from static scans requires the modeling of clothing deformations in different poses. Existing learning-based methods typically add pose-dependent deformations upon a minimally-clothed mesh template or a learned implicit template, which have limitations in capturing details or hinder end-to-end learning. In this paper, we revisit point-based solutions and propose to decompose explicit garment-related templates and then add pose-dependent wrinkles to them. In this way, the clothing deformations are disentangled such that the pose-dependent wrinkles can be better learned and applied to unseen poses. Additionally, to tackle the seam artifact issues in recent state-of-the-art point-based methods, we propose to learn point features on a body surface, which establishes a continuous and compact feature space to capture the fine-grained and pose-dependent clothing geometry. To facilitate the research in this field, we also introduce a high-quality scan dataset of humans in real-world clothing. Our approach is validated on two existing datasets and our newly introduced dataset, showing better clothing deformation results in unseen poses. The project page with code and dataset can be found at https://www.liuyebin.com/closet.

* Accepted to CVPR 2023. Project page: https://www.liuyebin.com/closet

Via

Access Paper or Ask Questions

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Mar 25, 2023

Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes

Abstract:Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at \url{https://github.com/OpenNLPLab/FNAC_AVL}.

* CVPR2023

Via

Access Paper or Ask Questions

Box-Level Active Detection

Mar 23, 2023

Mengyao Lyu, Jundong Zhou, Hui Chen, Yijie Huang, Dongdong Yu, Yaqian Li, Yandong Guo, Yuchen Guo, Liuyu Xiang, Guiguang Ding

Abstract:Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at https://github.com/lyumengyao/blad.

* CVPR 2023 highlight

Via

Access Paper or Ask Questions

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Mar 14, 2023

Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, Shanghang Zhang

Figure 1 for PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Figure 2 for PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Figure 3 for PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Figure 4 for PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Abstract:Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE.

* Accepted by CVPR2023. Code is available at https://github.com/BLVLab/PiMAE

Via

Access Paper or Ask Questions

Tag2Text: Guiding Vision-Language Model via Image Tagging

Mar 10, 2023

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, Lei Zhang

Figure 1 for Tag2Text: Guiding Vision-Language Model via Image Tagging

Figure 2 for Tag2Text: Guiding Vision-Language Model via Image Tagging

Figure 3 for Tag2Text: Guiding Vision-Language Model via Image Tagging

Figure 4 for Tag2Text: Guiding Vision-Language Model via Image Tagging

Abstract:This paper presents Tag2Text, a vision language pre-training (VLP) framework, which introduces image tagging into vision-language models to guide the learning of visual-linguistic features. In contrast to prior works which utilize object tags either manually labeled or automatically detected with a limited detector, our approach utilizes tags parsed from its paired text to learn an image tagger and meanwhile provides guidance to vision-language models. Given that, Tag2Text can utilize large-scale annotation-free image tags in accordance with image-text pairs, and provides more diverse tag categories beyond objects. As a result, Tag2Text achieves a superior image tag recognition ability by exploiting fine-grained text information. Moreover, by leveraging tagging guidance, Tag2Text effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks. Across a wide range of downstream benchmarks, Tag2Text achieves state-of-the-art or competitive results with similar model sizes and data scales, demonstrating the efficacy of the proposed tagging guidance.

Via

Access Paper or Ask Questions