Abstract:Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.
Abstract:In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.
Abstract:Data scarcity and heterogeneity pose significant performance challenges for personalized federated learning, and these challenges are mainly reflected in overfitting and low precision in existing methods. To overcome these challenges, a multi-layer multi-fusion strategy framework is proposed in this paper, i.e., the server adopts the network layer parameters of each client upload model as the basic unit of fusion for information-sharing calculation. Then, a new fusion strategy combining personalized and generic is purposefully proposed, and the network layer number fusion threshold of each fusion strategy is designed according to the network layer function. Under this mechanism, the L2-Norm negative exponential similarity metric is employed to calculate the fusion weights of the corresponding feature extraction layer parameters for each client, thus improving the efficiency of heterogeneous data personalized collaboration. Meanwhile, the federated global optimal model approximation fusion strategy is adopted in the network full-connect layer, and this generic fusion strategy alleviates the overfitting introduced by forceful personalized. Finally, the experimental results show that the proposed method is superior to the state-of-the-art methods.