Alert button
Picture for Yanhao Zhang

Yanhao Zhang

Alert button

View Consistent Purification for Accurate Cross-View Localization

Aug 16, 2023
Shan Wang, Yanhao Zhang, Akhil Perincherry, Ankit Vora, Hongdong Li

This paper proposes a fine-grained self-localization method for outdoor robotics that utilizes a flexible number of onboard cameras and readily accessible satellite images. The proposed method addresses limitations in existing cross-view localization methods that struggle to handle noise sources such as moving objects and seasonal variations. It is the first sparse visual-only method that enhances perception in dynamic environments by detecting view-consistent key points and their corresponding deep features from ground and satellite views, while removing off-the-ground objects and establishing homography transformation between the two views. Moreover, the proposed method incorporates a spatial embedding approach that leverages camera intrinsic and extrinsic information to reduce the ambiguity of purely visual matching, leading to improved feature matching and overall pose estimation accuracy. The method exhibits strong generalization and is robust to environmental changes, requiring only geo-poses as ground truth. Extensive experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that our proposed method outperforms existing state-of-the-art methods, achieving median spatial accuracy errors below $0.5$ meters along the lateral and longitudinal directions, and a median orientation accuracy error below 2 degrees.

* Accepted for ICCV 2023 
Viaarxiv icon

All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation

Aug 08, 2023
Weixuan Sun, Yanhao Zhang, Zhen Qin, Zheyuan Liu, Lin Cheng, Fanyi Wang, Yiran Zhong, Nick Barnes

Figure 1 for All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation
Figure 2 for All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation
Figure 3 for All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation
Figure 4 for All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation

In this work, we propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS). In image-level WSSS, Class Activation Map (CAM) is adopted to generate object localization as pseudo segmentation labels. To address the partial activation issue of the CAMs, consistency regularization is employed to maintain activation intensity invariance across various image augmentations. However, such methods ignore pair-wise relations among regions within each CAM, which capture context and should also be invariant across image views. To this end, we propose a new all-pairs consistency regularization (ACR). Given a pair of augmented views, our approach regularizes the activation intensities between a pair of augmented views, while also ensuring that the affinity across regions within each view remains consistent. We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity. This enables us to simply regularize the distance between the attention matrices of augmented image pairs. Additionally, we introduce a novel class-wise localization method that leverages the gradients of the class token. Our method can be seamlessly integrated into existing WSSS methods using transformers without modifying the architectures. We evaluate our method on PASCAL VOC and MS COCO datasets. Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train), resulting in superior WSSS performances.

* ICCV 2023 workshop 
Viaarxiv icon

TALL: Thumbnail Layout for Deepfake Video Detection

Jul 14, 2023
Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He

Figure 1 for TALL: Thumbnail Layout for Deepfake Video Detection
Figure 2 for TALL: Thumbnail Layout for Deepfake Video Detection
Figure 3 for TALL: Thumbnail Layout for Deepfake Video Detection
Figure 4 for TALL: Thumbnail Layout for Deepfake Video Detection

The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$\%$ AUC on the challenging cross-dataset task, FaceForensics++ $\to$ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.

* Accepted by ICCV 2023 
Viaarxiv icon

An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems

May 02, 2023
Weixuan Sun, Zheyuan Liu, Yanhao Zhang, Yiran Zhong, Nick Barnes

Figure 1 for An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems
Figure 2 for An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems
Figure 3 for An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems
Figure 4 for An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems

The Segment Anything Model (SAM) has demonstrated exceptional performance and versatility, making it a promising tool for various related tasks. In this report, we explore the application of SAM in Weakly-Supervised Semantic Segmentation (WSSS). Particularly, we adapt SAM as the pseudo-label generation pipeline given only the image-level class labels. While we observed impressive results in most cases, we also identify certain limitations. Our study includes performance evaluations on PASCAL VOC and MS-COCO, where we achieved remarkable improvements over the latest state-of-the-art methods on both datasets. We anticipate that this report encourages further explorations of adopting SAM in WSSS, as well as wider real-world applications.

* Technique report 
Viaarxiv icon

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Mar 25, 2023
Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes

Figure 1 for Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Figure 2 for Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Figure 3 for Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Figure 4 for Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at \url{https://github.com/OpenNLPLab/FNAC_AVL}.

* CVPR2023 
Viaarxiv icon

GAM : Gradient Attention Module of Optimization for Point Clouds Analysis

Mar 21, 2023
Haotian Hu, Fanyi Wang, Jingwen Su, Hongtao Zhou, Yaonong Wang, Laifeng Hu, Yanhao Zhang, Zhiwang Zhang

Figure 1 for GAM : Gradient Attention Module of Optimization for Point Clouds Analysis
Figure 2 for GAM : Gradient Attention Module of Optimization for Point Clouds Analysis
Figure 3 for GAM : Gradient Attention Module of Optimization for Point Clouds Analysis
Figure 4 for GAM : Gradient Attention Module of Optimization for Point Clouds Analysis

In point cloud analysis tasks, the existing local feature aggregation descriptors (LFAD) are unable to fully utilize information in the neighborhood of central points. Previous methods rely solely on Euclidean distance to constrain the local aggregation process, which can be easily affected by abnormal points and cannot adequately fit with the original geometry of the point cloud. We believe that fine-grained geometric information (FGGI) is significant for the aggregation of local features. Therefore, we propose a gradient-based local attention module, termed as Gradient Attention Module (GAM), to address the aforementioned problem. Our proposed GAM simplifies the process that extracts gradient information in the neighborhood and uses the Zenith Angle matrix and Azimuth Angle matrix as explicit representation, which accelerates the module by 35X. Comprehensive experiments were conducted on five benchmark datasets to demonstrate the effectiveness and generalization capability of the proposed GAM for 3D point cloud analysis. Especially on S3DIS dataset, GAM achieves the best performance among current point-based models with mIoU/OA/mAcc of 74.4%/90.6%/83.2%, respectively.

* GAM : Gradient Attention Module of Optimization for Point Clouds Analysis. Haotian Hu, Fanyi Wang, Jingwen Su, Hongtao Zhou, Yaonong Wang, Laifeng Hu, Yanhao Zhang and, Zhiwang Zhang. In AAAI, 2023  
* In AAAI, 2023 
Viaarxiv icon

Knowledge Distillation from Single to Multi Labels: an Empirical Study

Mar 15, 2023
Youcai Zhang, Yuzhuo Qin, Hengwei Liu, Yanhao Zhang, Yaqian Li, Xiaodong Gu

Figure 1 for Knowledge Distillation from Single to Multi Labels: an Empirical Study
Figure 2 for Knowledge Distillation from Single to Multi Labels: an Empirical Study
Figure 3 for Knowledge Distillation from Single to Multi Labels: an Empirical Study
Figure 4 for Knowledge Distillation from Single to Multi Labels: an Empirical Study

Knowledge distillation (KD) has been extensively studied in single-label image classification. However, its efficacy for multi-label classification remains relatively unexplored. In this study, we firstly investigate the effectiveness of classical KD techniques, including logit-based and feature-based methods, for multi-label classification. Our findings indicate that the logit-based method is not well-suited for multi-label classification, as the teacher fails to provide inter-category similarity information or regularization effect on student model's training. Moreover, we observe that feature-based methods struggle to convey compact information of multiple labels simultaneously. Given these limitations, we propose that a suitable dark knowledge should incorporate class-wise information and be highly correlated with the final classification results. To address these issues, we introduce a novel distillation method based on Class Activation Maps (CAMs), which is both effective and straightforward to implement. Across a wide range of settings, CAMs-based distillation consistently outperforms other methods.

Viaarxiv icon

NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction

Sep 29, 2022
Ruyi Zha, Yanhao Zhang, Hongdong Li

This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction (Cone Beam Computed Tomography) that requires no external training data. Specifically, the desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. We synthesize projections discretely and train the network by minimizing the error between real and synthesized projections. A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details. This encoder outperforms the commonly used frequency-domain encoder in terms of having higher performance and efficiency, because it exploits the smoothness and sparsity of human organs. Experiments have been conducted on both human organ and phantom datasets. The proposed method achieves state-of-the-art accuracy and spends reasonably short computation time.

* MICCAI2022 (Oral) 
Viaarxiv icon

Satellite Image Based Cross-view Localization for Autonomous Vehicle

Jul 27, 2022
Shan Wang, Yanhao Zhang, Hongdong Li

Figure 1 for Satellite Image Based Cross-view Localization for Autonomous Vehicle
Figure 2 for Satellite Image Based Cross-view Localization for Autonomous Vehicle
Figure 3 for Satellite Image Based Cross-view Localization for Autonomous Vehicle
Figure 4 for Satellite Image Based Cross-view Localization for Autonomous Vehicle

Existing spatial localization techniques for autonomous vehicles mostly use a pre-built 3D-HD map, often constructed using a survey-grade 3D mapping vehicle, which is not only expensive but also laborious. This paper shows that by using an off-the-shelf high-definition satellite image as a ready-to-use map, we are able to achieve cross-view vehicle localization up to a satisfactory accuracy, providing a cheaper and more practical way for localization. Although the idea of using satellite images for cross-view localization is not new, previous methods almost exclusively treat the task as image retrieval, namely matching a vehicle-captured ground-view image with the satellite image. This paper presents a novel cross-view localization method, which departs from the common wisdom of image retrieval. Specifically, our method develops (1) a Geometric-align Feature Extractor (GaFE) that leverages measured 3D points to bridge the geometric gap between ground view and overhead view, (2) a Pose Aware Branch (PAB) adopting a triplet loss to encourage pose-aware feature extracting, and (3) a Recursive Pose Refine Branch (RPRB) using the Levenberg-Marquardt (LM) algorithm to align the initial pose towards the true vehicle pose iteratively. Our method is validated on KITTI and Ford Multi-AV Seasonal datasets as ground view and Google Maps as the satellite view. The results demonstrate the superiority of our method in cross-view localization with spatial and angular errors within 1 meter and $2^\circ$, respectively. The code will be made publicly available.

Viaarxiv icon