Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin Zheng

VMDNet: Time Series Forecasting with Leakage-Free Samplewise Variational Mode Decomposition and Multibranch Decoding

Sep 18, 2025

Weibin Feng, Ran Tao, John Cartlidge, Jin Zheng

Abstract:In time series forecasting, capturing recurrent temporal patterns is essential; decomposition techniques make such structure explicit and thereby improve predictive performance. Variational Mode Decomposition (VMD) is a powerful signal-processing method for periodicity-aware decomposition and has seen growing adoption in recent years. However, existing studies often suffer from information leakage and rely on inappropriate hyperparameter tuning. To address these issues, we propose VMDNet, a causality-preserving framework that (i) applies sample-wise VMD to avoid leakage; (ii) represents each decomposed mode with frequency-aware embeddings and decodes it using parallel temporal convolutional networks (TCNs), ensuring mode independence and efficient learning; and (iii) introduces a bilevel, Stackelberg-inspired optimisation to adaptively select VMD's two core hyperparameters: the number of modes (K) and the bandwidth penalty (alpha). Experiments on two energy-related datasets demonstrate that VMDNet achieves state-of-the-art results when periodicity is strong, showing clear advantages in capturing structured periodic patterns while remaining robust under weak periodicity.

* 5 pages, 1 figure, 2 tables

Via

Access Paper or Ask Questions

Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders

Jun 12, 2025

Hui Yang, Wei Sun, Jian Liu, Jin Zheng, Jian Xiao, Ajmal Mian

Abstract:Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Apr 14, 2025

Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian

Figure 1 for MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Figure 2 for MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Figure 3 for MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Figure 4 for MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Abstract:Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.

* Accepted by ICRA'25

Via

Access Paper or Ask Questions

Novel Object 6D Pose Estimation with a Single Reference View

Mar 07, 2025

Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Lin Wang, Hossein Rahmani, Ajmal Mian

Figure 1 for Novel Object 6D Pose Estimation with a Single Reference View

Figure 2 for Novel Object 6D Pose Estimation with a Single Reference View

Figure 3 for Novel Object 6D Pose Estimation with a Single Reference View

Figure 4 for Novel Object 6D Pose Estimation with a Single Reference View

Abstract:Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in the camera coordinate system based on state space models (SSMs). Specifically, iterative camera-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.

* 17 pages, 12 figures (including supplementary material)

Via

Access Paper or Ask Questions

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Feb 27, 2025

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Jun Zhou, Lin Gu

Figure 1 for InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Figure 2 for InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Figure 3 for InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Figure 4 for InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Abstract:Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking heads.

* Accepted at CVPR 2025. Project page: https://fictionarry.github.io/InsTaG/

Via

Access Paper or Ask Questions

Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

Dec 05, 2024

Yunhua Pei, Jin Zheng, John Cartlidge

Figure 1 for Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

Figure 2 for Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

Figure 3 for Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

Figure 4 for Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

Abstract:Temporal Graph Learning (TGL) is crucial for capturing the evolving nature of stock markets. Traditional methods often ignore the interplay between dynamic temporal changes and static relational structures between stocks. To address this issue, we propose the Dynamic Graph Representation with Contrastive Learning (DGRCL) framework, which integrates dynamic and static graph relations to improve the accuracy of stock trend prediction. Our framework introduces two key components: the Embedding Enhancement (EE) module and the Contrastive Constrained Training (CCT) module. The EE module focuses on dynamically capturing the temporal evolution of stock data, while the CCT module enforces static constraints based on stock relations, refined within contrastive learning. This dual-relation approach allows for a more comprehensive understanding of stock market dynamics. Our experiments on two major U.S. stock market datasets, NASDAQ and NYSE, demonstrate that DGRCL significantly outperforms state-of-the-art TGL baselines. Ablation studies indicate the importance of both modules. Overall, DGRCL not only enhances prediction ability but also provides a robust framework for integrating temporal and relational data in dynamic graphs. Code and data are available for public access.

* 12 pages, 2 figures, author manuscript accepted for ICAART 2025 (International Conference on Agents and Artificial Intelligence)

Via

Access Paper or Ask Questions

Long-Tailed Out-of-Distribution Detection via Normalized Outlier Distribution Adaptation

Oct 28, 2024

Wenjun Miao, Guansong Pang, Jin Zheng, Xiao Bai

Abstract:One key challenge in Out-of-Distribution (OOD) detection is the absence of ground-truth OOD samples during training. One principled approach to address this issue is to use samples from external datasets as outliers (i.e., pseudo OOD samples) to train OOD detectors. However, we find empirically that the outlier samples often present a distribution shift compared to the true OOD samples, especially in Long-Tailed Recognition (LTR) scenarios, where ID classes are heavily imbalanced, \ie, the true OOD samples exhibit very different probability distribution to the head and tailed ID classes from the outliers. In this work, we propose a novel approach, namely normalized outlier distribution adaptation (AdaptOD), to tackle this distribution shift problem. One of its key components is dynamic outlier distribution adaptation that effectively adapts a vanilla outlier distribution based on the outlier samples to the true OOD distribution by utilizing the OOD knowledge in the predicted OOD samples during inference. Further, to obtain a more reliable set of predicted OOD samples on long-tailed ID data, a novel dual-normalized energy loss is introduced in AdaptOD, which leverages class- and sample-wise normalized energy to enforce a more balanced prediction energy on imbalanced ID samples. This helps avoid bias toward the head samples and learn a substantially better vanilla outlier distribution than existing energy losses during training. It also eliminates the need of manually tuning the sensitive margin hyperparameters in energy losses. Empirical results on three popular benchmarks for OOD detection in LTR show the superior performance of AdaptOD over state-of-the-art methods. Code is available at \url{https://github.com/mala-lab/AdaptOD}.

* NIPS2024

Via

Access Paper or Ask Questions

Prompting Continual Person Search

Oct 25, 2024

Pengcheng Zhang, Xiaohan Yu, Xiao Bai, Jin Zheng, Xin Ning

Figure 1 for Prompting Continual Person Search

Figure 2 for Prompting Continual Person Search

Figure 3 for Prompting Continual Person Search

Figure 4 for Prompting Continual Person Search

Abstract:The development of person search techniques has been greatly promoted in recent years for its superior practicality and challenging goals. Despite their significant progress, existing person search models still lack the ability to continually learn from increaseing real-world data and adaptively process input from different domains. To this end, this work introduces the continual person search task that sequentially learns on multiple domains and then performs person search on all seen domains. This requires balancing the stability and plasticity of the model to continually learn new knowledge without catastrophic forgetting. For this, we propose a Prompt-based Continual Person Search (PoPS) model in this paper. First, we design a compositional person search transformer to construct an effective pre-trained transformer without exhaustive pre-training from scratch on large-scale person search data. This serves as the fundamental for prompt-based continual learning. On top of that, we design a domain incremental prompt pool with a diverse attribute matching module. For each domain, we independently learn a set of prompts to encode the domain-oriented knowledge. Meanwhile, we jointly learn a group of diverse attribute projections and prototype embeddings to capture discriminative domain attributes. By matching an input image with the learned attributes across domains, the learned prompts can be properly selected for model inference. Extensive experiments are conducted to validate the proposed method for continual person search. The source code is available at https://github.com/PatrickZad/PoPS.

* ACM MM 2024

Via

Access Paper or Ask Questions

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Jul 09, 2024

Wenjun Miao, Guansong Pang, Trong-Tung Nguyen, Ruohang Fang, Jin Zheng, Xiao Bai

Figure 1 for OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Figure 2 for OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Figure 3 for OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Figure 4 for OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Abstract:Class incremental learning (CIL) aims to learn a model that can not only incrementally accommodate new classes, but also maintain the learned knowledge of old classes. Out-of-distribution (OOD) detection in CIL is to retain this incremental learning ability, while being able to reject unknown samples that are drawn from different distributions of the learned classes. This capability is crucial to the safety of deploying CIL models in open worlds. However, despite remarkable advancements in the respective CIL and OOD detection, there lacks a systematic and large-scale benchmark to assess the capability of advanced CIL models in detecting OOD samples. To fill this gap, in this study we design a comprehensive empirical study to establish such a benchmark, named $\textbf{OpenCIL}$. To this end, we propose two principled frameworks for enabling four representative CIL models with 15 diverse OOD detection methods, resulting in 60 baseline models for OOD detection in CIL. The empirical evaluation is performed on two popular CIL datasets with six commonly-used OOD datasets. One key observation we find through our comprehensive evaluation is that the CIL models can be severely biased towards the OOD samples and newly added classes when they are exposed to open environments. Motivated by this, we further propose a new baseline for OOD detection in CIL, namely Bi-directional Energy Regularization ($\textbf{BER}$), which is specially designed to mitigate these two biases in different CIL models by having energy regularization on both old and new classes. Its superior performance is justified in our experiments. All codes and datasets are open-source at https://github.com/mala-lab/OpenCIL.

Via

Access Paper or Ask Questions

CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

May 20, 2024

Jiawei Zhang, Jiahe Li, Xiaohan Yu, Lei Huang, Lin Gu, Jin Zheng, Xiao Bai

Figure 1 for CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

Figure 2 for CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

Figure 3 for CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

Figure 4 for CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

Abstract:3D Gaussian Splatting (3DGS) creates a radiance field consisting of 3D Gaussians to represent a scene. With sparse training views, 3DGS easily suffers from overfitting, negatively impacting the reconstruction quality. This paper introduces a new co-regularization perspective for improving sparse-view 3DGS. When training two 3D Gaussian radiance fields with the same sparse views of a scene, we observe that the two radiance fields exhibit \textit{point disagreement} and \textit{rendering disagreement} that can unsupervisedly predict reconstruction quality, stemming from the sampling implementation in densification. We further quantify the point disagreement and rendering disagreement by evaluating the registration between Gaussians' point representations and calculating differences in their rendered pixels. The empirical study demonstrates the negative correlation between the two disagreements and accurate reconstruction, which allows us to identify inaccurate reconstruction without accessing ground-truth information. Based on the study, we propose CoR-GS, which identifies and suppresses inaccurate reconstruction based on the two disagreements: (\romannumeral1) Co-pruning considers Gaussians that exhibit high point disagreement in inaccurate positions and prunes them. (\romannumeral2) Pseudo-view co-regularization considers pixels that exhibit high rendering disagreement are inaccurately rendered and suppress the disagreement. Results on LLFF, Mip-NeRF360, DTU, and Blender demonstrate that CoR-GS effectively regularizes the scene geometry, reconstructs the compact representations, and achieves state-of-the-art novel view synthesis quality under sparse training views.

* Project page: https://jiaw-z.github.io/CoR-GS/

Via

Access Paper or Ask Questions