Abstract:RGB-Event tracking improves localization robustness by fusing RGB appearance textures and dense temporal motion cues from event sensors. While this multi-modal scheme broadens tracking applicability, real-world scenes suffer diverse structured signal degradations that hinder traditional multi-modal fusion. In harsh environments, either modality can lose reliability drastically, and targets frequently appear incomplete due to occlusion, edge truncation and foreground clutter.To tackle the above challenges, we present a hierarchical perturbation and retrieval framework tailored for RGB-Event tracking with robustness against partial target missing and modal degradation, termed APRTrack. To mimic real-world signal corruption, APRTrack constructs structured degradation via two adversarial perturbation branches at the modality and spatial levels, which separately simulate full-modal failure and localized target region absence. A hierarchical routing mechanism is designed to disentangle the training pipelines of the two perturbation types, effectively eliminating feature collapse induced by superimposed degradation constraints. Furthermore, we devise Footprint-guided Channel-calibrated Hopfield Retrieval (FCHR) for reliable historical information compensation. This module evaluates retrieval confidence based on association footprints between queries and memory banks, and calibrates the retrieval metric space prior to Hopfield matching, realizing controllable historical feature compensation bounded to target regions. Extensive experiments on FE108, COESOT, VisEvent, and FELT datasets demonstrate the effectiveness of our proposed strategies for the RGB-Event visual object tracking. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenEvTracking
Abstract:Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD
Abstract:Current remote sensing change detection (CD) methods mainly rely on specialized models, which limits the scalability toward modality-adaptive Earth observation. For homogeneous CD, precise boundary delineation relies on fine-grained spatial cues and local pixel interactions, whereas heterogeneous CD instead requires broader contextual information to suppress speckle noise and geometric distortions. Moreover, difference operator (e.g., subtraction) works well for aligned homogeneous images but introduces artifacts in cross-modal or geometrically misaligned scenarios. Across different modality settings, specialized models based on static backbones or fixed difference operations often prove insufficient. To address this challenge, we propose UniRoute, a unified framework for modality-adaptive learning by reformulating feature extraction and fusion as conditional routing problems. We introduce an Adaptive Receptive Field Routing MoE (AR2-MoE) module to disentangle local spatial details from global semantic context, and a Modality-Aware Difference Routing MoE (MDR-MoE) module to adaptively select the most suitable fusion primitive at each pixel. In addition, we propose a Consistency-Aware Self-Distillation (CASD) strategy that stabilizes unified training under data-scarce heterogeneous settings by enforcing multi-level consistency. Extensive experiments on five public datasets demonstrate that UniRoute achieves strong overall performance, with a favorable accuracy-efficiency trade-off under a unified deployment setting.




Abstract:Accurate detection of changes in roads and bridges, such as construction, renovation, and demolition, is essential for urban planning and traffic management. However, existing methods often struggle to extract fine-grained semantic change information due to the lack of high-quality annotated datasets in traffic scenarios. To address this, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset, a comprehensive benchmark comprising 260 pairs of high-resolution remote sensing images from diverse cities and countries. RB-SCD captures 11 types of semantic changes across varied road and bridge structures, enabling detailed structural and functional analysis. Building on this dataset, we propose a novel framework, Multimodal Frequency-Driven Change Detector (MFDCD), which integrates multimodal features in the frequency domain. MFDCD includes a Dynamic Frequency Coupler (DFC) that fuses hierarchical visual features with wavelet-based frequency components, and a Textual Frequency Filter (TFF) that transforms CLIP-derived textual features into the frequency domain and applies graph-based filtering. Experimental results on RB-SCD and three public benchmarks demonstrate the effectiveness of our approach.