Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Li

Michael

SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping

Apr 17, 2025

Yun-Cheng Li, Sen Lei, Yi-Tao Zhao, Heng-Chao Li, Jun Li, Antonio Plaza

Abstract:Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at https://github.com/SUPERMAN123000/FAEWNet.

Via

Access Paper or Ask Questions

AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Apr 13, 2025

Xing Zi, Tengjun Ni, Xianjing Fan, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad

Abstract:Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

Via

Access Paper or Ask Questions

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Apr 04, 2025

Niu Lian, Jun Li, Jinpeng Wang, Ruisheng Luo, Yaowei Wang, Shu-Tao Xia, Bin Chen

Figure 1 for AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Figure 2 for AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Figure 3 for AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Figure 4 for AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Abstract:Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.

* Accepted by CVPR'25. 11 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Mar 27, 2025

Ling Feng, Tianyu Xie, Wei Ma, Ruijie Fu, Yingxiao Zhang, Jun Li, Bei Zhou

Figure 1 for DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Figure 2 for DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Figure 3 for DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Figure 4 for DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

Abstract:The modernization of smart farming is a way to improve agricultural production efficiency, and improve the agricultural production environment. Although many large models have achieved high accuracy in the task of object recognition and segmentation, they cannot really be put into use in the farming industry due to their own poor interpretability and limitations in computational volume. In this paper, we built AnYue Shelduck Dateset, which contains a total of 1951 Shelduck datasets, and performed target detection and segmentation annotation with the help of professional annotators. Based on AnYue ShelduckDateset, this paper describes DuckProcessing, an efficient and powerful module for duck identification based on real shelduckfarms. First of all, using the YOLOv8 module designed to divide the mahjong between them, Precision reached 98.10%, Recall reached 96.53% and F1 score reached 0.95 on the test set. Again using the DuckSegmentation segmentation model, DuckSegmentation reached 96.43% mIoU. Finally, the excellent DuckSegmentation was used as the teacher model, and through knowledge distillation, Deeplabv3 r50 was used as the student model, and the final student model achieved 94.49% mIoU on the test set. The method provides a new way of thinking in practical sisal duck smart farming.

Via

Access Paper or Ask Questions

Invertible Koopman neural operator for data-driven modeling of partial differential equations

Mar 25, 2025

Yuhong Jin, Andong Cong, Lei Hou, Qiang Gao, Xiangdong Ge, Chonglong Zhu, Yongzhi Feng, Jun Li

Figure 1 for Invertible Koopman neural operator for data-driven modeling of partial differential equations

Figure 2 for Invertible Koopman neural operator for data-driven modeling of partial differential equations

Figure 3 for Invertible Koopman neural operator for data-driven modeling of partial differential equations

Figure 4 for Invertible Koopman neural operator for data-driven modeling of partial differential equations

Abstract:Koopman operator theory is a popular candidate for data-driven modeling because it provides a global linearization representation for nonlinear dynamical systems. However, existing Koopman operator-based methods suffer from shortcomings in constructing the well-behaved observable function and its inverse and are inefficient enough when dealing with partial differential equations (PDEs). To address these issues, this paper proposes the Invertible Koopman Neural Operator (IKNO), a novel data-driven modeling approach inspired by the Koopman operator theory and neural operator. IKNO leverages an Invertible Neural Network to parameterize observable function and its inverse simultaneously under the same learnable parameters, explicitly guaranteeing the reconstruction relation, thus eliminating the dependency on the reconstruction loss, which is an essential improvement over the original Koopman Neural Operator (KNO). The structured linear matrix inspired by the Koopman operator theory is parameterized to learn the evolution of observables' low-frequency modes in the frequency space rather than directly in the observable space, sustaining IKNO is resolution-invariant like other neural operators. Moreover, with preprocessing such as interpolation and dimension expansion, IKNO can be extended to operator learning tasks defined on non-Cartesian domains. We fully support the above claims based on rich numerical and real-world examples and demonstrate the effectiveness of IKNO and superiority over other neural operators.

* 25 pages, 10 figures

Via

Access Paper or Ask Questions

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Mar 25, 2025

Zhiying Song, Lei Yang, Fuxi Wen, Jun Li

Figure 1 for TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Figure 2 for TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Figure 3 for TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Figure 4 for TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Abstract:Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle's current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.

* Accepted to CVPR 2025

Via

Access Paper or Ask Questions

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Mar 19, 2025

Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei

Figure 1 for Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Figure 2 for Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Figure 3 for Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Figure 4 for Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Abstract:Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

* CVPR 2025

Via

Access Paper or Ask Questions

A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

Mar 17, 2025

Ziqiang Li, Jun Li, Lizhi Xiong, Zhangjie Fu, Zechao Li

Abstract:Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.

* Under review

Via

Access Paper or Ask Questions

Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

Mar 10, 2025

Xing Zi, Kairui Jin, Xian Tao, Jun Li, Ali Braytee, Rajiv Ratn Shah, Mukesh Prasad

Abstract:Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.

* Under Review - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

Via

Access Paper or Ask Questions

DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

Mar 06, 2025

Zhiqiang Yan, Zhengxue Wang, Haoye Dong, Jun Li, Jian Yang, Gim Hee Lee

Abstract:We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Correlative Fusion (CF) and Gradient Regulation (GR). CF facilitates precise geometric alignment and effective fusion between prompt and depth features, while GR refines depth predictions by enforcing consistency with sharp-edged depth maps derived from foundation models. Crucially, these prompts are seamlessly embedded into the Lagrangian constraint term, forming a synergistic and principled framework. Extensive experiments demonstrate that DuCos outperforms existing state-of-the-art methods, achieving superior accuracy, robustness, and generalization. The source codes and pre-trained models will be publicly available.

Via

Access Paper or Ask Questions