Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Huiyu Zhou

Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook

May 01, 2025

Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Huiyu Zhou, Jinchang Ren, Shiming Xiang, Xiangtai Li, Guangliang Cheng

Abstract:Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (https://github.com/BaoBao0926/Awesome-Mamba-in-Remote-Sensing) to foster community-driven advancements.

Via

Access Paper or Ask Questions

Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

Apr 23, 2025

Xinru Meng, Han Sun, Jiamei Liu, Ningzhong Liu, Huiyu Zhou

Figure 1 for Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

Figure 2 for Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

Figure 3 for Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

Figure 4 for Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

Abstract:Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.

* 8 pages, 3 figures, accepted by PRL. code at https://github.com/Sthen111/EBPR

Via

Access Paper or Ask Questions

From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion

Apr 19, 2025

Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Michael Felsberg, Dacheng Tao, Xuelong Li

Abstract:Image completion is a challenging task, particularly when ensuring that generated content seamlessly integrates with existing parts of an image. While recent diffusion models have shown promise, they often struggle with maintaining coherence between known and unknown (missing) regions. This issue arises from the lack of explicit spatial and semantic alignment during the diffusion process, resulting in content that does not smoothly integrate with the original image. Additionally, diffusion models typically rely on global learned distributions rather than localized features, leading to inconsistencies between the generated and existing image parts. In this work, we propose ConFill, a novel framework that introduces a Context-Adaptive Discrepancy (CAD) model to ensure that intermediate distributions of known and unknown regions are closely aligned throughout the diffusion process. By incorporating CAD, our model progressively reduces discrepancies between generated and original images at each diffusion step, leading to contextually aligned completion. Moreover, ConFill uses a new Dynamic Sampling mechanism that adaptively increases the sampling rate in regions with high reconstruction complexity. This approach enables precise adjustments, enhancing detail and integration in restored areas. Extensive experiments demonstrate that ConFill outperforms current methods, setting a new benchmark in image completion.

* Accepted in TPAMI

Via

Access Paper or Ask Questions

k-fold Subsampling based Sequential Backward Feature Elimination

Mar 14, 2025

Jeonghwan Park, Kang Li, Huiyu Zhou

Figure 1 for k-fold Subsampling based Sequential Backward Feature Elimination

Figure 2 for k-fold Subsampling based Sequential Backward Feature Elimination

Figure 3 for k-fold Subsampling based Sequential Backward Feature Elimination

Figure 4 for k-fold Subsampling based Sequential Backward Feature Elimination

Abstract:We present a new wrapper feature selection algorithm for human detection. This algorithm is a hybrid feature selection approach combining the benefits of filter and wrapper methods. It allows the selection of an optimal feature vector that well represents the shapes of the subjects in the images. In detail, the proposed feature selection algorithm adopts the k-fold subsampling and sequential backward elimination approach, while the standard linear support vector machine (SVM) is used as the classifier for human detection. We apply the proposed algorithm to the publicly accessible INRIA and ETH pedestrian full image datasets with the PASCAL VOC evaluation criteria. Compared to other state of the arts algorithms, our feature selection based approach can improve the detection speed of the SVM classifier by over 50% with up to 2% better detection accuracy. Our algorithm also outperforms the equivalent systems introduced in the deformable part model approach with around 9% improvement in the detection accuracy.

* 8 pages

Via

Access Paper or Ask Questions

Joint Beamforming and Compressed Sensing for Uplink Grant-Free Access

Mar 09, 2025

Guoqing Xia, Pei Xiao, Bohan Li, Yue Zhang, Huiyu Zhou

Figure 1 for Joint Beamforming and Compressed Sensing for Uplink Grant-Free Access

Figure 2 for Joint Beamforming and Compressed Sensing for Uplink Grant-Free Access

Figure 3 for Joint Beamforming and Compressed Sensing for Uplink Grant-Free Access

Figure 4 for Joint Beamforming and Compressed Sensing for Uplink Grant-Free Access

Abstract:Compressed sensing (CS)-based techniques have been widely applied in the grant-free non-orthogonal multiple access (NOMA) to a single-antenna base station (BS). In this paper, we consider the multi-antenna reception at the BS for uplink grant-free access for the massive machine type communication (mMTC) with limited channel resources. To enhance the overloading performance of the BS, we develop a general framework for the synergistic amalgamation of the spatial division multiple access (SDMA) technique with the CS-based grant-free NOMA. We derive a closed-form statistical beamforming and a dynamic beamforming scheme for the inter-cluster interference suppression when applying SDMA. Based on this, we further develop a joint adaptive beamforming and subspace pursuit (JABF-SP) algorithm for the multiuser detection and data recovery, with a novel sparsity level decision method without the accurate knowledge of the noise level. To further improve the data recovery performance, we propose an interference cancellation based J-ABF-SP scheme (J-ABF-SP-IC) by using the initial signal estimates generated from the J-ABF-SP algorithm. Illustrative simulations verify the superior user detection and signal recovery performance of our proposed algorithms in comparison with existing CS-based grant-free NOMA techniques.

* 17 pages, 17 figures

Via

Access Paper or Ask Questions

Cell-Free MIMO-ISAC: Joint Location and Velocity Estimation and Fundamental CRLB Analysis

Mar 09, 2025

Guoqing Xia, Pei Xiao, Qu Luo, Bing Ji, Yue Zhang, Huiyu Zhou

Abstract:This paper investigates joint location and velocity estimation, along with their fundamental performance bounds analysis, in a cell-free multi-input multi-output (MIMO) integrated sensing and communication (ISAC) system. First, unlike existing studies that derive likelihood functions for target parameter estimation using continuous received signals, we formulate the maximum likelihood estimation (MLE) for radar sensing based on discrete received signals at a given sampling rate. Second, leveraging the proposed MLEs, we derive closed-form Cramer-Rao lower bounds (CRLBs) for joint location and velocity estimation in both single-target and multiple-target scenarios. Third, to enhance computational efficiency, we propose approximate CRLBs and conduct an in-depth accuracy analysis. Additionally, we thoroughly examine the impact of sampling rate, squared effective bandwidth, and time width on CRLB performance. For multiple-target scenarios, the concepts of safety distance and safety velocity are introduced to characterize conditions under which the CRLBs for multiple targets converge to their single target counterparts. Finally, extensive simulations are conducted to verify the accuracy of the proposed CRLBs and the theoretical results using state-of-the-art waveforms, namely orthogonal frequency division multiplexing (OFDM) and orthogonal chirp division multiplexing (OCDM).

* 17 pages, 13 figures

Via

Access Paper or Ask Questions

Fractional Correspondence Framework in Detection Transformer

Mar 06, 2025

Masoumeh Zareapoor, Pourya Shamsolmoali, Huiyu Zhou, Yue Lu, Salvador García

Figure 1 for Fractional Correspondence Framework in Detection Transformer

Figure 2 for Fractional Correspondence Framework in Detection Transformer

Figure 3 for Fractional Correspondence Framework in Detection Transformer

Figure 4 for Fractional Correspondence Framework in Detection Transformer

Abstract:The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model's capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.

* ACMMM-2024

Via

Access Paper or Ask Questions

SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Feb 20, 2025

Chenjun Zhao, Xuesen Niu, Xinglin Yu, Long Chen, Na Lv, Huiyu Zhou, Aite Zhao

Figure 1 for SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Figure 2 for SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Figure 3 for SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Figure 4 for SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Abstract:Sleep staging is a key method for assessing sleep quality and diagnosing sleep disorders. However, current deep learning methods face challenges: 1) postfusion techniques ignore the varying contributions of different modalities; 2) unprocessed sleep data can interfere with frequency-domain information. To tackle these issues, this paper proposes a gated multimodal temporal neural network for multidomain sleep data, including heart rate, motion, steps, EEG (Fpz-Cz, Pz-Oz), and EOG from WristHR-Motion-Sleep and SleepEDF-78. The model integrates: 1) a pre-processing module for feature alignment, missing value handling, and EEG de-trending; 2) a feature extraction module for complex sleep features in the time dimension; and 3) a dynamic fusion module for real-time modality weighting.Experiments show classification accuracies of 85.03% on SleepEDF-78 and 94.54% on WristHR-Motion-Sleep datasets. The model handles heterogeneous datasets and outperforms state-of-the-art models by 1.00%-4.00%.

Via

Access Paper or Ask Questions

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Feb 17, 2025

Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu

Abstract:Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.

* JS and YY are co-first authors. XH is the corresponding author

Via

Access Paper or Ask Questions

RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Feb 02, 2025

Yuanhuiyi Lyu, Xu Zheng, Lutao Jiang, Yibo Yan, Xin Zou, Huiyu Zhou, Linfeng Zhang, Xuming Hu

Figure 1 for RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Figure 2 for RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Figure 3 for RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Figure 4 for RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Abstract:Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we present the first real-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by self-reflective contrastive learning, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application to all types of state-of-the-art text-to-image generative models and also delivers remarkable performance boosts with all of them, such as a gain of 16.18% FID score with the auto-regressive model on the Stanford Car benchmark.

Via

Access Paper or Ask Questions