Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Li Zhang

Shammie

DG-SLAM: Robust Dynamic Gaussian Splatting SLAM with Hybrid Pose Optimization

Nov 13, 2024

Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, Li Zhang

Figure 1 for DG-SLAM: Robust Dynamic Gaussian Splatting SLAM with Hybrid Pose Optimization

Figure 2 for DG-SLAM: Robust Dynamic Gaussian Splatting SLAM with Hybrid Pose Optimization

Figure 3 for DG-SLAM: Robust Dynamic Gaussian Splatting SLAM with Hybrid Pose Optimization

Figure 4 for DG-SLAM: Robust Dynamic Gaussian Splatting SLAM with Hybrid Pose Optimization

Abstract:Achieving robust and precise pose estimation in dynamic scenes is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent advancements integrating Gaussian Splatting into SLAM systems have proven effective in creating high-quality renderings using explicit 3D Gaussian models, significantly improving environmental reconstruction fidelity. However, these approaches depend on a static environment assumption and face challenges in dynamic environments due to inconsistent observations of geometry and photometry. To address this problem, we propose DG-SLAM, the first robust dynamic visual SLAM system grounded in 3D Gaussians, which provides precise camera pose estimation alongside high-fidelity reconstructions. Specifically, we propose effective strategies, including motion mask generation, adaptive Gaussian point management, and a hybrid camera tracking algorithm to improve the accuracy and robustness of pose estimation. Extensive experiments demonstrate that DG-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, and novel-view synthesis in dynamic scenes, outperforming existing methods meanwhile preserving real-time rendering ability.

Via

Access Paper or Ask Questions

WassFFed: Wasserstein Fair Federated Learning

Nov 11, 2024

Zhongxuan Han, Li Zhang, Chaochao Chen, Xiaolin Zheng, Fei Zheng, Yuyuan Li, Jianwei Yin

Figure 1 for WassFFed: Wasserstein Fair Federated Learning

Figure 2 for WassFFed: Wasserstein Fair Federated Learning

Figure 3 for WassFFed: Wasserstein Fair Federated Learning

Figure 4 for WassFFed: Wasserstein Fair Federated Learning

Abstract:Federated Learning (FL) employs a training approach to address scenarios where users' data cannot be shared across clients. Achieving fairness in FL is imperative since training data in FL is inherently geographically distributed among diverse user groups. Existing research on fairness predominantly assumes access to the entire training data, making direct transfer to FL challenging. However, the limited existing research on fairness in FL does not effectively address two key challenges, i.e., (CH1) Current methods fail to deal with the inconsistency between fair optimization results obtained with surrogate functions and fair classification results. (CH2) Directly aggregating local fair models does not always yield a globally fair model due to non Identical and Independent data Distributions (non-IID) among clients. To address these challenges, we propose a Wasserstein Fair Federated Learning framework, namely WassFFed. To tackle CH1, we ensure that the outputs of local models, rather than the loss calculated with surrogate functions or classification results with a threshold, remain independent of various user groups. To resolve CH2, we employ a Wasserstein barycenter calculation of all local models' outputs for each user group, bringing local model outputs closer to the global output distribution to ensure consistency between the global model and local models. We conduct extensive experiments on three real-world datasets, demonstrating that WassFFed outperforms existing approaches in striking a balance between accuracy and fairness.

* Submitted to TKDE

Via

Access Paper or Ask Questions

NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

Oct 16, 2024

Zhen Liang, Weishan Ye, Qile Liu, Li Zhang, Gan Huang, Yongjie Zhou

Figure 1 for NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

Figure 2 for NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

Figure 3 for NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

Figure 4 for NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

Abstract:Non-suicidal self-injury (NSSI) is a serious threat to the physical and mental health of adolescents, significantly increasing the risk of suicide and attracting widespread public concern. Electroencephalography (EEG), as an objective tool for identifying brain disorders, holds great promise. However, extracting meaningful and reliable features from high-dimensional EEG data, especially by integrating spatiotemporal brain dynamics into informative representations, remains a major challenge. In this study, we introduce an advanced semi-supervised adversarial network, NSSI-Net, to effectively model EEG features related to NSSI. NSSI-Net consists of two key modules: a spatial-temporal feature extraction module and a multi-concept discriminator. In the spatial-temporal feature extraction module, an integrated 2D convolutional neural network (2D-CNN) and a bi-directional Gated Recurrent Unit (BiGRU) are used to capture both spatial and temporal dynamics in EEG data. In the multi-concept discriminator, signal, gender, domain, and disease levels are fully explored to extract meaningful EEG features, considering individual, demographic, disease variations across a diverse population. Based on self-collected NSSI data (n=114), the model's effectiveness and reliability are demonstrated, with a 7.44% improvement in performance compared to existing machine learning and deep learning methods. This study advances the understanding and early diagnosis of NSSI in adolescents with depression, enabling timely intervention. The source code is available at https://github.com/Vesan-yws/NSSINet.

Via

Access Paper or Ask Questions

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

Oct 15, 2024

Zhi Wang, Li Zhang, Wenhao Wu, Yuanheng Zhu, Dongbin Zhao, Chunlin Chen

Abstract:A longstanding goal of artificial general intelligence is highly capable generalists that can learn from diverse experiences and generalize to unseen tasks. The language and vision communities have seen remarkable progress toward this trend by scaling up transformer-based models trained on massive datasets, while reinforcement learning (RL) agents still suffer from poor generalization capacity under such paradigms. To tackle this challenge, we propose Meta Decision Transformer (Meta-DT), which leverages the sequential modeling ability of the transformer architecture and robust task representation learning via world model disentanglement to achieve efficient generalization in offline meta-RL. We pretrain a context-aware world model to learn a compact task representation, and inject it as a contextual condition to the causal transformer to guide task-oriented sequence generation. Then, we subtly utilize history trajectories generated by the meta-policy as a self-guided prompt to exploit the architectural inductive bias. We select the trajectory segment that yields the largest prediction error on the pretrained world model to construct the prompt, aiming to encode task-specific information complementary to the world model maximally. Notably, the proposed framework eliminates the requirement of any expert demonstration or domain knowledge at test time. Experimental results on MuJoCo and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites. Our code is available at https://github.com/NJU-RL/Meta-DT.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

Oct 15, 2024

Hongyu An, Xinfeng Zhang, Li Zhang, Ruiqin Xiong

Figure 1 for Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

Figure 2 for Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

Figure 3 for Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

Figure 4 for Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

Abstract:Omnidirectional video (ODV) can provide an immersive experience and is widely utilized in the field of virtual reality and augmented reality. However, the restricted capturing devices and transmission bandwidth lead to the low resolution of ODVs. Video super-resolution (VSR) methods are proposed to enhance the resolution of videos, but ODV projection distortions in the application are not well addressed directly applying such methods. To achieve better super-resolution reconstruction quality, we propose a novel Spatio-Temporal Distortion Aware Network (STDAN) oriented to ODV characteristics. Specifically, a spatio-temporal distortion modulation module is introduced to improve spatial ODV projection distortions and exploit the temporal correlation according to intra and inter alignments. Next, we design a multi-frame reconstruction and fusion mechanism to refine the consistency of reconstructed ODV frames. Furthermore, we incorporate latitude-saliency adaptive maps in the loss function to concentrate on important viewpoint regions with higher texture complexity and human-watching interest. In addition, we collect a new ODV-SR dataset with various scenarios. Extensive experimental results demonstrate that the proposed STDAN achieves superior super-resolution performance on ODVs and outperforms state-of-the-art methods.

Via

Access Paper or Ask Questions

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Oct 13, 2024

Yuchen Li, Li Zhang, Youwei Liang, Pengtao Xie

Figure 1 for AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Figure 2 for AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Figure 3 for AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Figure 4 for AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Abstract:Segment Anything Model (SAM) has gained significant recognition in the field of semantic segmentation due to its versatile capabilities and impressive performance. Despite its success, SAM faces two primary limitations: (1) it relies heavily on meticulous human-provided prompts like key points, bounding boxes or text messages, which is labor-intensive; (2) the mask decoder's feature representation is sometimes inaccurate, as it solely employs dot product operations at the end of mask decoder, which inadequately captures the necessary correlations for precise segmentation. Current solutions to these problems such as fine-tuning SAM often require retraining a large number of parameters, which needs huge amount of time and computing resources. To address these limitations, we propose an automated prompting and mask calibration method called AM-SAM based on a bi-level optimization framework. Our approach automatically generates prompts for an input image, eliminating the need for human involvement with a good performance in early training epochs, achieving faster convergence. Additionally, we freeze the main part of SAM, and modify the mask decoder with Low-Rank Adaptation (LoRA), enhancing the mask decoder's feature representation by incorporating advanced techniques that go beyond simple dot product operations to more accurately capture and utilize feature correlations. Our experimental results demonstrate that AM-SAM achieves significantly accurate segmentation, matching or exceeding the effectiveness of human-generated and default prompts. Notably, on the body segmentation dataset, our method yields a 5% higher dice score with a 4-example few-shot training set compared to the SOTA method, underscoring its superiority in semantic segmentation tasks.

Via

Access Paper or Ask Questions

ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Oct 13, 2024

Wei Jiang, Junru Li, Kai Zhang, Li Zhang

Figure 1 for ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Figure 2 for ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Figure 3 for ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Figure 4 for ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Abstract:In Learned Video Compression (LVC), improving inter prediction, such as enhancing temporal context mining and mitigating accumulated errors, is crucial for boosting rate-distortion performance. Existing LVCs mainly focus on mining the temporal movements within adjacent frames, neglecting non-local correlations among frames. Additionally, current contextual video compression models use a single reference frame, which is insufficient for handling complex movements. To address these issues, we propose leveraging non-local correlations across multiple frames to enhance temporal priors, significantly boosting rate-distortion performance. To mitigate error accumulation, we introduce a partial cascaded fine-tuning strategy that supports fine-tuning on full-length sequences with constrained computational resources. This method reduces the train-test mismatch in sequence lengths and significantly decreases accumulated errors. Based on the proposed techniques, we present a video compression scheme ECVC. Experiments demonstrate that our ECVC achieves state-of-the-art performance, reducing 7.3% and 10.5% more bit-rates than DCVC-DC and DCVC-FM over VTM-13.2 low delay B (LDB), respectively, when the intra period (IP) is 32. Additionally, ECVC reduces 11.1% more bit-rate than DCVC-FM over VTM-13.2 LDB when the IP is -1. Our Code will be available at https://github.com/JiangWeibeta/ECVC.

* Code will be available at https://github.com/JiangWeibeta/ECVC

Via

Access Paper or Ask Questions

DeMo: Decoupling Motion Forecasting into Directional Intentions and Dynamic States

Oct 08, 2024

Bozhou Zhang, Nan Song, Li Zhang

Figure 1 for DeMo: Decoupling Motion Forecasting into Directional Intentions and Dynamic States

Figure 2 for DeMo: Decoupling Motion Forecasting into Directional Intentions and Dynamic States

Figure 3 for DeMo: Decoupling Motion Forecasting into Directional Intentions and Dynamic States

Figure 4 for DeMo: Decoupling Motion Forecasting into Directional Intentions and Dynamic States

Abstract:Accurate motion forecasting for traffic agents is crucial for ensuring the safety and efficiency of autonomous driving systems in dynamically changing environments. Mainstream methods adopt a one-query-one-trajectory paradigm, where each query corresponds to a unique trajectory for predicting multi-modal trajectories. While straightforward and effective, the absence of detailed representation of future trajectories may yield suboptimal outcomes, given that the agent states dynamically evolve over time. To address this problem, we introduce DeMo, a framework that decouples multi-modal trajectory queries into two types: mode queries capturing distinct directional intentions and state queries tracking the agent's dynamic states over time. By leveraging this format, we separately optimize the multi-modality and dynamic evolutionary properties of trajectories. Subsequently, the mode and state queries are integrated to obtain a comprehensive and detailed representation of the trajectories. To achieve these operations, we additionally introduce combined Attention and Mamba techniques for global information aggregation and state sequence modeling, leveraging their respective strengths. Extensive experiments on both the Argoverse 2 and nuScenes benchmarks demonstrate that our DeMo achieves state-of-the-art performance in motion forecasting.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Motion Forecasting in Continuous Driving

Oct 08, 2024

Nan Song, Bozhou Zhang, Xiatian Zhu, Li Zhang

Figure 1 for Motion Forecasting in Continuous Driving

Figure 2 for Motion Forecasting in Continuous Driving

Figure 3 for Motion Forecasting in Continuous Driving

Figure 4 for Motion Forecasting in Continuous Driving

Abstract:Motion forecasting for agents in autonomous driving is highly challenging due to the numerous possibilities for each agent's next action and their complex interactions in space and time. In real applications, motion forecasting takes place repeatedly and continuously as the self-driving car moves. However, existing forecasting methods typically process each driving scene within a certain range independently, totally ignoring the situational and contextual relationships between successive driving scenes. This significantly simplifies the forecasting task, making the solutions suboptimal and inefficient to use in practice. To address this fundamental limitation, we propose a novel motion forecasting framework for continuous driving, named RealMotion. It comprises two integral streams both at the scene level: (1) The scene context stream progressively accumulates historical scene information until the present moment, capturing temporal interactive relationships among scene elements. (2) The agent trajectory stream optimizes current forecasting by sequentially relaying past predictions. Besides, a data reorganization strategy is introduced to narrow the gap between existing benchmarks and real-world applications, consistent with our network. These approaches enable exploiting more broadly the situational and progressive insights of dynamic motion across space and time. Extensive experiments on Argoverse series with different settings demonstrate that our RealMotion achieves state-of-the-art performance, along with the advantage of efficient real-world inference. The source code will be available at https://github.com/fudan-zvg/RealMotion.

* Accepted at NeurIPS 2024 Spotlight

Via

Access Paper or Ask Questions

Metadata Matters for Time Series: Informative Forecasting with Transformers

Oct 04, 2024

Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Li Zhang, Jianmin Wang, Mingsheng Long

Figure 1 for Metadata Matters for Time Series: Informative Forecasting with Transformers

Figure 2 for Metadata Matters for Time Series: Informative Forecasting with Transformers

Figure 3 for Metadata Matters for Time Series: Informative Forecasting with Transformers

Figure 4 for Metadata Matters for Time Series: Informative Forecasting with Transformers

Abstract:Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g.~dataset and variate descriptions) also carries valuable information essential for forecasting, which can be used to identify the application scenario and provide more interpretable knowledge than digit sequences. Inspired by this observation, we propose a Metadata-informed Time Series Transformer (MetaTST), which incorporates multiple levels of context-specific metadata into Transformer forecasting models to enable informative time series forecasting. To tackle the unstructured nature of metadata, MetaTST formalizes them into natural languages by pre-designed templates and leverages large language models (LLMs) to encode these texts into metadata tokens as a supplement to classic series tokens, resulting in an informative embedding. Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting. This design also allows the model to adaptively learn context-specific patterns across various scenarios, which is particularly effective in handling large-scale, diverse-scenario forecasting tasks. Experimentally, MetaTST achieves state-of-the-art compared to advanced time series models and LLM-based methods on widely acknowledged short- and long-term forecasting benchmarks, covering both single-dataset individual and multi-dataset joint training settings.

Via

Access Paper or Ask Questions