Abstract:Image modality is not perfect as it often fails in certain conditions, e.g., night and fast motion. This significantly limits the robustness and versatility of existing multi-modal (i.e., Image+X) semantic segmentation methods when confronting modality absence or failure, as often occurred in real-world applications. Inspired by the open-world learning capability of multi-modal vision-language models (MVLMs), we explore a new direction in learning the modality-agnostic representation via knowledge distillation (KD) from MVLMs. Intuitively, we propose Any2Seg, a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions. Specifically, we first introduce a novel language-guided semantic correlation distillation (LSCD) module to transfer both inter-modal and intra-modal semantic knowledge in the embedding space from MVLMs, e.g., LanguageBind. This enables us to minimize the modality gap and alleviate semantic ambiguity to combine any modalities in any visual conditions. Then, we introduce a modality-agnostic feature fusion (MFF) module that reweights the multi-modal features based on the inter-modal correlation and selects the fine-grained feature. This way, our Any2Seg finally yields an optimal modality-agnostic representation. Extensive experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting (+3.54 mIoU) and excels in the challenging modality-incomplete setting(+19.79 mIoU).
Abstract:Fusing an arbitrary number of modalities is vital for achieving robust multi-modal fusion of semantic segmentation yet remains less explored to date. Recent endeavors regard RGB modality as the center and the others as the auxiliary, yielding an asymmetric architecture with two branches. However, the RGB modality may struggle in certain circumstances, e.g., nighttime, while others, e.g., event data, own their merits; thus, it is imperative for the fusion model to discern robust and fragile modalities, and incorporate the most robust and fragile ones to learn a resilient multi-modal framework. To this end, we propose a novel method, named MAGIC, that can be flexibly paired with various backbones, ranging from compact to high-performance models. Our method comprises two key plug-and-play modules. Firstly, we introduce a multi-modal aggregation module to efficiently process features from multi-modal batches and extract complementary scene information. On top, a unified arbitrary-modal selection module is proposed to utilize the aggregated features as the benchmark to rank the multi-modal features based on the similarity scores. This way, our method can eliminate the dependence on RGB modality and better overcome sensor failures while ensuring the segmentation performance. Under the commonly considered multi-modal setting, our method achieves state-of-the-art performance while reducing the model parameters by 60%. Moreover, our method is superior in the novel modality-agnostic setting, where it outperforms prior arts by a large margin of +19.41% mIoU
Abstract:Event cameras harness advantages such as low latency, high temporal resolution, and high dynamic range (HDR), compared to standard cameras. Due to the distinct imaging paradigm shift, a dominant line of research focuses on event-to-video (E2V) reconstruction to bridge event-based and standard computer vision. However, this task remains challenging due to its inherently ill-posed nature: event cameras only detect the edge and motion information locally. Consequently, the reconstructed videos are often plagued by artifacts and regional blur, primarily caused by the ambiguous semantics of event data. In this paper, we find language naturally conveys abundant semantic information, rendering it stunningly superior in ensuring semantic consistency for E2V reconstruction. Accordingly, we propose a novel framework, called LaSe-E2V, that can achieve semantic-aware high-quality E2V reconstruction from a language-guided perspective, buttressed by the text-conditional diffusion models. However, due to diffusion models' inherent diversity and randomness, it is hardly possible to directly apply them to achieve spatial and temporal consistency for E2V reconstruction. Thus, we first propose an Event-guided Spatiotemporal Attention (ESA) module to condition the event data to the denoising pipeline effectively. We then introduce an event-aware mask loss to ensure temporal coherence and a noise initialization strategy to enhance spatial consistency. Given the absence of event-text-video paired data, we aggregate existing E2V datasets and generate textual descriptions using the tagging models for training and evaluation. Extensive experiments on three datasets covering diverse challenging scenarios (e.g., fast motion, low light) demonstrate the superiority of our method. Dataset and code will be available upon acceptance.
Abstract:Federated Learning (FL) is gaining widespread interest for its ability to share knowledge while preserving privacy and reducing communication costs. Unlike Centralized FL, Decentralized FL (DFL) employs a network architecture that eliminates the need for a central server, allowing direct communication among clients and leading to significant communication resource savings. However, due to data heterogeneity, not all neighboring nodes contribute to enhancing the local client's model performance. In this work, we introduce \textbf{\emph{AFIND+}}, a simple yet efficient algorithm for sampling and aggregating neighbors in DFL, with the aim of leveraging collaboration to improve clients' model performance. AFIND+ identifies helpful neighbors, adaptively adjusts the number of selected neighbors, and strategically aggregates the sampled neighbors' models based on their contributions. Numerical results on real-world datasets with diverse data partitions demonstrate that AFIND+ outperforms other sampling algorithms in DFL and is compatible with most existing DFL optimization algorithms.
Abstract:Recently, electroencephalography (EEG) signals have been actively incorporated to decode brain activity to visual or textual stimuli and achieve object recognition in multi-modal AI. Accordingly, endeavors have been focused on building EEG-based datasets from visual or textual single-modal stimuli. However, these datasets offer limited EEG epochs per category, and the complex semantics of stimuli presented to participants compromise their quality and fidelity in capturing precise brain activity. The study in neuroscience unveils that the relationship between visual and textual stimulus in EEG recordings provides valuable insights into the brain's ability to process and integrate multi-modal information simultaneously. Inspired by this, we propose a novel large-scale multi-modal dataset, named EIT-1M, with over 1 million EEG-image-text pairs. Our dataset is superior in its capacity of reflecting brain activities in simultaneously processing multi-modal information. To achieve this, we collected data pairs while participants viewed alternating sequences of visual-textual stimuli from 60K natural images and category-specific texts. Common semantic categories are also included to elicit better reactions from participants' brains. Meanwhile, response-based stimulus timing and repetition across blocks and sessions are included to ensure data diversity. To verify the effectiveness of EIT-1M, we provide an in-depth analysis of EEG data captured from multi-modal stimuli across different categories and participants, along with data quality scores for transparency. We demonstrate its validity on two tasks: 1) EEG recognition from visual or textual stimuli or both and 2) EEG-to-visual generation.
Abstract:Unsupervised domain adaption (UDA) has emerged as a popular solution to tackle the divergence between the labeled source and unlabeled target domains. Recently, some research efforts have been made to leverage large vision-language models, such as CLIP, and then fine-tune or learn prompts from them for addressing the challenging UDA task. In this work, we shift the gear to a new direction by directly leveraging CLIP to measure the domain divergence and propose a novel language-guided approach for UDA, dubbed as CLIP-Div. Our key idea is to harness CLIP to 1) measure the domain divergence via the acquired domain-agnostic distribution and 2) calibrate the target pseudo labels with language guidance, to effectively reduce the domain gap and improve the UDA model's generalization capability. Specifically, our major technical contribution lies in the proposed two novel language-guided domain divergence measurement losses: absolute divergence and relative divergence. These loss terms furnish precise guidelines for aligning the distributions of the source and target domains with the domain-agnostic distribution derived from CLIP. Additionally, we propose a language-guided pseudo-labeling strategy for calibrating the target pseudo labels. Buttressed by it, we show that a further implementation for self-training can enhance the UDA model's generalization capability on the target domain. CLIP-Div surpasses state-of-the-art CNN-based methods by a substantial margin, achieving a performance boost of +10.3% on Office-Home, +1.5% on Office-31, +0.2% on VisDA-2017, and +24.3% on DomainNet, respectively.
Abstract:The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target information and background. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Gating to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that fine-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to improve interaction and discriminability between the target information and background. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.
Abstract:3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.
Abstract:Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. M\"obius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.
Abstract:The advent of large language models (LLMs) has revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks. However, the pre-training or fine-tuning of LLMs within a federated learning (FL) framework poses substantial challenges, including considerable computational and memory resource demands, as well as communication bottlenecks between servers and clients. Existing solutions either make the unrealistic assumption that the entire model is exchanged for training, or apply parameter-effective fine-tuning methods from centralized learning to train LLMs in FL which tend to underperform during training or fine-tuning stages due to the limited search subspace of parameter updating. In this paper, we introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption. Our approach, termed FedCyBGD, utilizes Cycle Block Gradient Descent to periodically update the model. In particular, we design a compression scheme for FedCyBGD, aiming to further decrease the model download cost. It enables full parameter training in FL with only selected block updates and uploads, thereby reducing communication, computation, and memory costs. Our method achieves state-of-the-art performance for FL LLM training, while significantly reducing associated costs. Codes are provided here.