Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuan Xie

Progressive Contrastive Learning with Multi-Prototype for Unsupervised Visible-Infrared Person Re-identification

Feb 29, 2024

Jiangming Shi, Xiangbo Yin, Yaoxing Wang, Xiaofeng Liu, Yuan Xie, Yanyun Qu

Figure 1 for Progressive Contrastive Learning with Multi-Prototype for Unsupervised Visible-Infrared Person Re-identification

Figure 2 for Progressive Contrastive Learning with Multi-Prototype for Unsupervised Visible-Infrared Person Re-identification

Figure 3 for Progressive Contrastive Learning with Multi-Prototype for Unsupervised Visible-Infrared Person Re-identification

Figure 4 for Progressive Contrastive Learning with Multi-Prototype for Unsupervised Visible-Infrared Person Re-identification

Abstract:Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified people in infrared images to visible images without annotation, and vice versa. USVI-ReID is a challenging yet under-explored task. Most existing methods address the USVI-ReID problem using cluster-based contrastive learning, which simply employs the cluster center as a representation of a person. However, the cluster center primarily focuses on shared information, overlooking disparity. To address the problem, we propose a Progressive Contrastive Learning with Multi-Prototype (PCLMP) method for USVI-ReID. In brief, we first generate the hard prototype by selecting the sample with the maximum distance from the cluster center. This hard prototype is used in the contrastive loss to emphasize disparity. Additionally, instead of rigidly aligning query images to a specific prototype, we generate the dynamic prototype by randomly picking samples within a cluster. This dynamic prototype is used to retain the natural variety of features while reducing instability in the simultaneous learning of both common and disparate information. Finally, we introduce a progressive learning strategy to gradually shift the model's attention towards hard samples, avoiding cluster deterioration. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets validate the effectiveness of the proposed method. PCLMP outperforms the existing state-of-the-art method with an average mAP improvement of 3.9%. The source codes will be released.

Via

Access Paper or Ask Questions

Guiding the underwater acoustic target recognition with interpretable contrastive learning

Feb 20, 2024

Yuan Xie, Jiawei Ren, Ji Xu

Figure 1 for Guiding the underwater acoustic target recognition with interpretable contrastive learning

Figure 2 for Guiding the underwater acoustic target recognition with interpretable contrastive learning

Figure 3 for Guiding the underwater acoustic target recognition with interpretable contrastive learning

Figure 4 for Guiding the underwater acoustic target recognition with interpretable contrastive learning

Abstract:Recognizing underwater targets from acoustic signals is a challenging task owing to the intricate ocean environments and variable underwater channels. While deep learning-based systems have become the mainstream approach for underwater acoustic target recognition, they have faced criticism for their lack of interpretability and weak generalization performance in practical applications. In this work, we apply the class activation mapping (CAM) to generate visual explanations for the predictions of a spectrogram-based recognition system. CAM can help to understand the behavior of recognition models by highlighting the regions of the input features that contribute the most to the prediction. Our explorations reveal that recognition models tend to focus on the low-frequency line spectrum and high-frequency periodic modulation information of underwater signals. Based on the observation, we propose an interpretable contrastive learning (ICL) strategy that employs two encoders to learn from acoustic features with different emphases (line spectrum and modulation information). By imposing constraints between encoders, the proposed strategy can enhance the generalization performance of the recognition system. Our experiments demonstrate that the proposed contrastive learning approach can improve the recognition accuracy and bring significant improvements across various underwater databases.

* OCEANS 2023-Limerick. IEEE, 2023: 1-6

Via

Access Paper or Ask Questions

Unraveling Complex Data Diversity in Underwater Acoustic Target Recognition through Convolution-based Mixture of Experts

Feb 19, 2024

Yuan Xie, Jiawei Ren, Ji Xu

Figure 1 for Unraveling Complex Data Diversity in Underwater Acoustic Target Recognition through Convolution-based Mixture of Experts

Figure 2 for Unraveling Complex Data Diversity in Underwater Acoustic Target Recognition through Convolution-based Mixture of Experts

Figure 3 for Unraveling Complex Data Diversity in Underwater Acoustic Target Recognition through Convolution-based Mixture of Experts

Figure 4 for Unraveling Complex Data Diversity in Underwater Acoustic Target Recognition through Convolution-based Mixture of Experts

Abstract:Underwater acoustic target recognition is a difficult task owing to the intricate nature of underwater acoustic signals. The complex underwater environments, unpredictable transmission channels, and dynamic motion states greatly impact the real-world underwater acoustic signals, and may even obscure the intrinsic characteristics related to targets. Consequently, the data distribution of underwater acoustic signals exhibits high intra-class diversity, thereby compromising the accuracy and robustness of recognition systems.To address these issues, this work proposes a convolution-based mixture of experts (CMoE) that recognizes underwater targets in a fine-grained manner. The proposed technique introduces multiple expert layers as independent learners, along with a routing layer that determines the assignment of experts according to the characteristics of inputs. This design allows the model to utilize independent parameter spaces, facilitating the learning of complex underwater signals with high intra-class diversity. Furthermore, this work optimizes the CMoE structure by balancing regularization and an optional residual module. To validate the efficacy of our proposed techniques, we conducted detailed experiments and visualization analyses on three underwater acoustic databases across several acoustic features. The experimental results demonstrate that our CMoE consistently achieves significant performance improvements, delivering superior recognition accuracy when compared to existing advanced methods.

* Expert Systems with Applications (2024): 123431

Via

Access Paper or Ask Questions

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Jan 12, 2024

Jiangming Shi, Xiangbo Yin, Yeyun Chen, Yachao Zhang, Zhizhong Zhang, Yuan Xie, Yanyun Qu

Figure 1 for Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Figure 2 for Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Figure 3 for Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Figure 4 for Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Abstract:Unsupervised visible-infrared person re-identification (USL-VI-ReID) is a promising yet challenging retrieval task. The key challenges in USL-VI-ReID are to effectively generate pseudo-labels and establish pseudo-label correspondences across modalities without relying on any prior annotations. Recently, clustered pseudo-label methods have gained more attention in USL-VI-ReID. However, previous methods fell short of fully exploiting the individual nuances, as they simply utilized a single memory that represented an identity to establish cross-modality correspondences, resulting in ambiguous cross-modality correspondences. To address the problem, we propose a Multi-Memory Matching (MMM) framework for USL-VI-ReID. We first design a Cross-Modality Clustering (CMC) module to generate the pseudo-labels through clustering together both two modality samples. To associate cross-modality clustered pseudo-labels, we design a Multi-Memory Learning and Matching (MMLM) module, ensuring that optimization explicitly focuses on the nuances of individual perspectives and establishes reliable cross-modality correspondences. Finally, we design a Soft Cluster-level Alignment (SCA) module to narrow the modality gap while mitigating the effect of noise pseudo-labels through a soft many-to-many alignment strategy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences and the effectiveness of our MMM. The source codes will be released.

Via

Access Paper or Ask Questions

CLIP-guided Federated Learning on Heterogeneous and Long-Tailed Data

Dec 14, 2023

Jiangming Shi, Shanshan Zheng, Xiangbo Yin, Yang Lu, Yuan Xie, Yanyun Qu

Figure 1 for CLIP-guided Federated Learning on Heterogeneous and Long-Tailed Data

Figure 2 for CLIP-guided Federated Learning on Heterogeneous and Long-Tailed Data

Figure 3 for CLIP-guided Federated Learning on Heterogeneous and Long-Tailed Data

Figure 4 for CLIP-guided Federated Learning on Heterogeneous and Long-Tailed Data

Abstract:Federated learning (FL) provides a decentralized machine learning paradigm where a server collaborates with a group of clients to learn a global model without accessing the clients' data. User heterogeneity is a significant challenge for FL, which together with the class-distribution imbalance further enhances the difficulty of FL. Great progress has been made in large vision-language models, such as Contrastive Language-Image Pre-training (CLIP), which paves a new way for image classification and object recognition. Inspired by the success of CLIP on few-shot and zero-shot learning, we use CLIP to optimize the federated learning between server and client models under its vision-language supervision. It is promising to mitigate the user heterogeneity and class-distribution balance due to the powerful cross-modality representation and rich open-vocabulary prior knowledge. In this paper, we propose the CLIP-guided FL (CLIP2FL) method on heterogeneous and long-tailed data. In CLIP2FL, the knowledge of the off-the-shelf CLIP model is transferred to the client-server models, and a bridge is built between the client and server. Specifically, for client-side learning, knowledge distillation is conducted between client models and CLIP to improve the ability of client-side feature representation. For server-side learning, in order to mitigate the heterogeneity and class-distribution imbalance, we generate federated features to retrain the server model. A prototype contrastive learning with the supervision of the text encoder of CLIP is introduced to generate federated features depending on the client-side gradients, and they are used to retrain a balanced server classifier.

* This paper has been accepted by AAAI24

Via

Access Paper or Ask Questions

Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation

Dec 13, 2023

Yujun Chen, Xin Tan, Zhizhong Zhang, Yanyun Qu, Yuan Xie

Abstract:As the exorbitant expense of labeling autopilot datasets and the growing trend of utilizing unlabeled data, semi-supervised segmentation on point clouds becomes increasingly imperative. Intuitively, finding out more ``unspoken words'' (i.e., latent instance information) beyond the label itself should be helpful to improve performance. In this paper, we discover two types of latent labels behind the displayed label embedded in LiDAR and image data. First, in the LiDAR Branch, we propose a novel augmentation, Cylinder-Mix, which is able to augment more yet reliable samples for training. Second, in the Image Branch, we propose the Instance Position-scale Learning (IPSL) Module to learn and fuse the information of instance position and scale, which is from a 2D pre-trained detector and a type of latent label obtained from 3D to 2D projection. Finally, the two latent labels are embedded into the multi-modal panoptic segmentation network. The ablation of the IPSL module demonstrates its robust adaptability, and the experiments evaluated on SemanticKITTI and nuScenes demonstrate that our model outperforms the state-of-the-art method, LaserMix.

* 12 pages, 8 figures

Via

Access Paper or Ask Questions

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Dec 04, 2023

Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, Yuan Xie

Figure 1 for COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Figure 2 for COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Figure 3 for COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Figure 4 for COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Abstract:The autonomous driving community has shown significant interest in 3D occupancy prediction, driven by its exceptional geometric perception and general object recognition capabilities. To achieve this, current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However, compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but reducant computational costs. To address the above limitations, we propose Compact Occupancy TRansformer (COTR), with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then, the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines, e.g., COTR outperforms baselines with a relative improvement of 8%-15%, demonstrating the superiority of our method.

Via

Access Paper or Ask Questions

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Nov 23, 2023

Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Hao Liu, Zhizhong Zhang, Xin Tan, Can Huang, Yuan Xie

Figure 1 for Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Figure 2 for Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Figure 3 for Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Figure 4 for Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Abstract:Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner, termed "In-Context Learning" (ICL). Nevertheless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end, we introduce E$^2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E$^2$STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E$^2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks.

Via

Access Paper or Ask Questions

Generalized Category Discovery in Semantic Segmentation

Nov 20, 2023

Zhengyuan Peng, Qijian Tian, Jianqing Xu, Yizhang Jin, Xuequan Lu, Xin Tan, Yuan Xie, Lizhuang Ma

Figure 1 for Generalized Category Discovery in Semantic Segmentation

Figure 2 for Generalized Category Discovery in Semantic Segmentation

Figure 3 for Generalized Category Discovery in Semantic Segmentation

Figure 4 for Generalized Category Discovery in Semantic Segmentation

Abstract:This paper explores a novel setting called Generalized Category Discovery in Semantic Segmentation (GCDSS), aiming to segment unlabeled images given prior knowledge from a labeled set of base classes. The unlabeled images contain pixels of the base class or novel class. In contrast to Novel Category Discovery in Semantic Segmentation (NCDSS), there is no prerequisite for prior knowledge mandating the existence of at least one novel class in each unlabeled image. Besides, we broaden the segmentation scope beyond foreground objects to include the entire image. Existing NCDSS methods rely on the aforementioned priors, making them challenging to truly apply in real-world situations. We propose a straightforward yet effective framework that reinterprets the GCDSS challenge as a task of mask classification. Additionally, we construct a baseline method and introduce the Neighborhood Relations-Guided Mask Clustering Algorithm (NeRG-MaskCA) for mask categorization to address the fragmentation in semantic representation. A benchmark dataset, Cityscapes-GCD, derived from the Cityscapes dataset, is established to evaluate the GCDSS framework. Our method demonstrates the feasibility of the GCDSS problem and the potential for discovering and segmenting novel object classes in unlabeled images. We employ the generated pseudo-labels from our approach as ground truth to supervise the training of other models, thereby enabling them with the ability to segment novel classes. It paves the way for further research in generalized category discovery, broadening the horizons of semantic segmentation and its applications. For details, please visit https://github.com/JethroPeng/GCDSS

Via

Access Paper or Ask Questions

LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Aug 11, 2023

Zhiwei Zhang, Zhizhong Zhang, Qian Yu, Ran Yi, Yuan Xie, Lizhuang Ma

Figure 1 for LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Figure 2 for LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Figure 3 for LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Figure 4 for LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Abstract:3D panoptic segmentation is a challenging perception task that requires both semantic segmentation and instance segmentation. In this task, we notice that images could provide rich texture, color, and discriminative information, which can complement LiDAR data for evident performance improvement, but their fusion remains a challenging problem. To this end, we propose LCPS, the first LiDAR-Camera Panoptic Segmentation network. In our approach, we conduct LiDAR-Camera fusion in three stages: 1) an Asynchronous Compensation Pixel Alignment (ACPA) module that calibrates the coordinate misalignment caused by asynchronous problems between sensors; 2) a Semantic-Aware Region Alignment (SARA) module that extends the one-to-one point-pixel mapping to one-to-many semantic relations; 3) a Point-to-Voxel feature Propagation (PVP) module that integrates both geometric and semantic fusion information for the entire point cloud. Our fusion strategy improves about 6.9% PQ performance over the LiDAR-only baseline on NuScenes dataset. Extensive quantitative and qualitative experiments further demonstrate the effectiveness of our novel framework. The code will be released at https://github.com/zhangzw12319/lcps.git.

* Accepted as ICCV 2023 paper

Via

Access Paper or Ask Questions