Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Licheng Jiao

DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Apr 14, 2025

Jinyue Zhang, Xiangrong Zhang, Zhongjian Huang, Tianyang Zhang, Yifei Jiang, Licheng Jiao

Figure 1 for DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Figure 2 for DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Figure 3 for DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Figure 4 for DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Abstract:Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Apr 11, 2025

Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang

Figure 1 for STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Figure 2 for STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Figure 3 for STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Figure 4 for STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Abstract:Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Via

Access Paper or Ask Questions

Generalization-aware Remote Sensing Change Detection via Domain-agnostic Learning

Apr 01, 2025

Qi Zang, Shuang Wang, Dong Zhao, Dou Quan, Yang Hu, Licheng Jiao

Abstract:Change detection has essential significance for the region's development, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing transformation-based methods regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). However, their efforts are limited by two drawbacks: 1) Transformed images suffer from distortion that reduces feature discrimination. 2) Alignment hampers the model from learning domain-agnostic representations that degrades performance on scenes with domain shifts from the training data. Therefore, oriented from pseudo-changes caused by style differences, we present a generalizable domain-agnostic difference learning network (DonaNet). For the drawback 1), we argue for local-level statistics as style proxies to assist against domain shifts. For the drawback 2), DonaNet learns domain-agnostic representations by removing domain-specific style of encoded features and highlighting the class characteristics of objects. In the removal, we propose a domain difference removal module to reduce feature variance while preserving discriminative properties and propose its enhanced version to provide possibilities for eliminating more style by decorrelating the correlation between features. In the highlighting, we propose a cross-temporal generalization learning strategy to imitate latent domain shifts, thus enabling the model to extract feature representations more robust to shifts actively. Extensive experiments conducted on three public datasets demonstrate that DonaNet outperforms existing state-of-the-art methods with a smaller model size and is more robust to domain shift.

Via

Access Paper or Ask Questions

EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Mar 14, 2025

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Guangming Shi, Licheng Jiao

Figure 1 for EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Figure 2 for EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Figure 3 for EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Figure 4 for EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Abstract:Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.

Via

Access Paper or Ask Questions

DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

Feb 23, 2025

Jianbin Jiao, Xina Cheng, Kailun Yang, Xiangrong Zhang, Licheng Jiao

Figure 1 for DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

Figure 2 for DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

Figure 3 for DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

Figure 4 for DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

Abstract:3D human pose estimation has wide applications in fields such as intelligent surveillance, motion capture, and virtual reality. However, in real-world scenarios, issues such as occlusion, noise interference, and missing viewpoints can severely affect pose estimation. To address these challenges, we introduce the task of Deficiency-Aware 3D Pose Estimation. Traditional 3D pose estimation methods often rely on multi-stage networks and modular combinations, which can lead to cumulative errors and increased training complexity, making them unable to effectively address deficiency-aware estimation. To this end, we propose DeProPose, a flexible method that simplifies the network architecture to reduce training complexity and avoid information loss in multi-stage designs. Additionally, the model innovatively introduces a multi-view feature fusion mechanism based on relative projection error, which effectively utilizes information from multiple viewpoints and dynamically assigns weights, enabling efficient integration and enhanced robustness to overcome deficiency-aware 3D Pose Estimation challenges. Furthermore, to thoroughly evaluate this end-to-end multi-view 3D human pose estimation model and to advance research on occlusion-related challenges, we have developed a novel 3D human pose estimation dataset, termed the Deficiency-Aware 3D Pose Estimation (DA-3DPE) dataset. This dataset encompasses a wide range of deficiency scenarios, including noise interference, missing viewpoints, and occlusion challenges. Compared to state-of-the-art methods, DeProPose not only excels in addressing the deficiency-aware problem but also shows improvement in conventional scenarios, providing a powerful and user-friendly solution for 3D human pose estimation. The source code will be available at https://github.com/WUJINHUAN/DeProPose.

* The source code will be available at https://github.com/WUJINHUAN/DeProPose

Via

Access Paper or Ask Questions

Learning Evolution via Optimization Knowledge Adaptation

Jan 04, 2025

Chao Wang, Licheng Jiao, Jiaxuan Zhao, Lingling Li, Fang Liu, Shuyuan Yang

Figure 1 for Learning Evolution via Optimization Knowledge Adaptation

Figure 2 for Learning Evolution via Optimization Knowledge Adaptation

Figure 3 for Learning Evolution via Optimization Knowledge Adaptation

Figure 4 for Learning Evolution via Optimization Knowledge Adaptation

Abstract:Evolutionary algorithms (EAs) maintain populations through evolutionary operators to discover diverse solutions for complex tasks while gathering valuable knowledge, such as historical population data and fitness evaluations. However, traditional EAs face challenges in dynamically adapting to expanding knowledge bases, hindering the efficient exploitation of accumulated information and limiting adaptability to new situations. To address these issues, we introduce an Optimization Knowledge Adaptation Evolutionary Model (OKAEM), which features dynamic parameter adjustment using accumulated knowledge to enhance its optimization capabilities. OKAEM employs attention mechanisms to model the interactions among individuals, fitness landscapes, and genetic components separately, thereby parameterizing the evolutionary operators of selection, crossover, and mutation. These powerful learnable operators enable OKAEM to benefit from pre-learned extensive prior knowledge and self-tune with real-time evolutionary insights. Experimental results demonstrate that OKAEM: 1) exploits prior knowledge for significant performance gains across various knowledge transfer settings; 2) achieves competitive performance through self-tuning alone, even without prior knowledge; 3) outperforms state-of-the-art black-box baselines in a vision-language model tuning case; 4) can improve its optimization capabilities with growing knowledge; 5) is capable of emulating principles of natural selection and genetic recombination.

* This work has been submitted to Springer Nature for possible publication

Via

Access Paper or Ask Questions

Learnable Evolutionary Multi-Objective Combinatorial Optimization via Sequence-to-Sequence Model

Dec 09, 2024

Jiaxiang Huang, Licheng Jiao

Abstract:Recent advances in learnable evolutionary algorithms have demonstrated the importance of leveraging population distribution information and historical evolutionary trajectories. While significant progress has been made in continuous optimization domains, combinatorial optimization problems remain challenging due to their discrete nature and complex solution spaces. To address this gap, we propose SeqMO, a novel learnable multi-objective combinatorial optimization method that integrates sequence-to-sequence models with evolutionary algorithms. Our approach divides approximate Pareto solution sets based on their objective values' distance to the Pareto front, and establishes mapping relationships between solutions by minimizing objective vector angles in the target space. This mapping creates structured training data for pointer networks, which learns to predict promising solution trajectories in the discrete search space. The trained model then guides the evolutionary process by generating new candidate solutions while maintaining population diversity. Experiments on the multi-objective travel salesman problem and the multi-objective quadratic assignment problem verify the effectiveness of the algorithm. Our code is available at: \href{https://github.com/jiaxianghuang/SeqMO}{https://github.com/jiaxianghuang/SeqMO}.

Via

Access Paper or Ask Questions

Knowledge-aware Evolutionary Graph Neural Architecture Search

Nov 26, 2024

Chao Wang, Jiaxuan Zhao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Shuyuan Yang

Figure 1 for Knowledge-aware Evolutionary Graph Neural Architecture Search

Figure 2 for Knowledge-aware Evolutionary Graph Neural Architecture Search

Figure 3 for Knowledge-aware Evolutionary Graph Neural Architecture Search

Figure 4 for Knowledge-aware Evolutionary Graph Neural Architecture Search

Abstract:Graph neural architecture search (GNAS) can customize high-performance graph neural network architectures for specific graph tasks or datasets. However, existing GNAS methods begin searching for architectures from a zero-knowledge state, ignoring the prior knowledge that may improve the search efficiency. The available knowledge base (e.g. NAS-Bench-Graph) contains many rich architectures and their multiple performance metrics, such as the accuracy (#Acc) and number of parameters (#Params). This study proposes exploiting such prior knowledge to accelerate the multi-objective evolutionary search on a new graph dataset, named knowledge-aware evolutionary GNAS (KEGNAS). KEGNAS employs the knowledge base to train a knowledge model and a deep multi-output Gaussian process (DMOGP) in one go, which generates and evaluates transfer architectures in only a few GPU seconds. The knowledge model first establishes a dataset-to-architecture mapping, which can quickly generate candidate transfer architectures for a new dataset. Subsequently, the DMOGP with architecture and dataset encodings is designed to predict multiple performance metrics for candidate transfer architectures on the new dataset. According to the predicted metrics, non-dominated candidate transfer architectures are selected to warm-start the multi-objective evolutionary algorithm for optimizing the #Acc and #Params on a new dataset. Empirical studies on NAS-Bench-Graph and five real-world datasets show that KEGNAS swiftly generates top-performance architectures, achieving 4.27% higher accuracy than advanced evolutionary baselines and 11.54% higher accuracy than advanced differentiable baselines. In addition, ablation studies demonstrate that the use of prior knowledge significantly improves the search performance.

* This work has been accepted by Knowledge-Based Systems

Via

Access Paper or Ask Questions

Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Nov 14, 2024

Yangyang Li, Xuanting Hao, Ronghua Shang, Licheng Jiao

Figure 1 for Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Figure 2 for Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Figure 3 for Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Figure 4 for Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Abstract:In view of the fact that semi- and self-supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi-supervised semantic segmentation methods have integrated representative self-supervised learning paradigms for further regularization. However, the potential of the state-of-the-art generative self-supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi-supervised semantic segmentation. Specifically, we introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask-induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra-class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well-known benchmarks demonstrate that our approach achieves state-of-the-art performance. The code will be available at https://github.com/haoxt/S4MIM.

* 13 pages. This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD++ Atomic Activity Recognition 2024

Oct 21, 2024

Jiamin Cao, Lingqi Wang, Kexin Zhang, Yuting Yang, Licheng Jiao, Yuwei Guo

Figure 1 for Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD++ Atomic Activity Recognition 2024

Abstract:Road++ Track3 proposes a multi-label atomic activity recognition task in traffic scenarios, which can be standardized as a 64-class multi-label video action recognition task. In the multi-label atomic activity recognition task, the robustness of visual feature extraction remains a key challenge, which directly affects the model performance and generalization ability. To cope with these issues, our team optimized three aspects: data processing, model and post-processing. Firstly, the appropriate resolution and video sampling strategy are selected, and a fixed sampling strategy is set on the validation and test sets. Secondly, in terms of model training, the team selects a variety of visual backbone networks for feature extraction, and then introduces the action-slot model, which is trained on the training and validation sets, and reasoned on the test set. Finally, for post-processing, the team combined the strengths and weaknesses of different models for weighted fusion, and the final mAP on the test set was 58%, which is 4% higher than the challenge baseline.

Via

Access Paper or Ask Questions