Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zirui Wang

EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

Jan 14, 2025

Zirui Wang, Zhenxi Song, Yi Guo, Yuxin Liu, Guoyang Xu, Min Zhang, Zhiguo Zhang

Figure 1 for EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

Figure 2 for EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

Figure 3 for EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

Figure 4 for EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

Abstract:The development of EEG decoding algorithms confronts challenges such as data sparsity, subject variability, and the need for precise annotations, all of which are vital for advancing brain-computer interfaces and enhancing the diagnosis of diseases. To address these issues, we propose a novel two-stage approach named Self-Supervised State Reconstruction-Primed Riemannian Dynamics (EEG-ReMinD) , which mitigates reliance on supervised learning and integrates inherent geometric features. This approach efficiently handles EEG data corruptions and reduces the dependency on labels. EEG-ReMinD utilizes self-supervised and geometric learning techniques, along with an attention mechanism, to analyze the temporal dynamics of EEG features within the framework of Riemannian geometry, referred to as Riemannian dynamics. Comparative analyses on both intact and corrupted datasets from two different neurodegenerative disorders underscore the enhanced performance of EEG-ReMinD.

Via

Access Paper or Ask Questions

SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC

Dec 24, 2024

Yue Deng, Yan Yu, Weiyu Ma, Zirui Wang, Wenhui Zhu, Jian Zhao, Yin Zhang

Abstract:The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL). In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm. However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies. To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness. SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability. Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms. We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents. Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries. We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems. Our code is available at https://github.com/devindeng94/smac-hard.

Via

Access Paper or Ask Questions

Learning Humanoid Locomotion with Perceptive Internal Model

Nov 21, 2024

Junfeng Long, Junli Ren, Moji Shi, Zirui Wang, Tao Huang, Ping Luo, Jiangmiao Pang

Figure 1 for Learning Humanoid Locomotion with Perceptive Internal Model

Figure 2 for Learning Humanoid Locomotion with Perceptive Internal Model

Figure 3 for Learning Humanoid Locomotion with Perceptive Internal Model

Figure 4 for Learning Humanoid Locomotion with Perceptive Internal Model

Abstract:In contrast to quadruped robots that can navigate diverse terrains using a "blind" policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.

* submitted to ICRA2025

Via

Access Paper or Ask Questions

HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Nov 03, 2024

Zirui Wang, Xinran Zhao, Simon Stepputtis, Woojun Kim, Tongshuang Wu, Katia Sycara, Yaqi Xie

Figure 1 for HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Figure 2 for HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Figure 3 for HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Figure 4 for HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Abstract:Understanding and predicting human actions has been a long-standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real-world human activity -- interactions. To address this gap in human-like forecasting within multi-agent environments, we present the Hierarchical Memory-Aware Transformer (HiMemFormer), a transformer-based model for online multi-agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent-specific features based on these global representations using a coarse-to-fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent-specific preferences to avoid noisy or redundant information in multi-agent action anticipation. Extensive experiments on various multi-agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state-of-the-art methods.

Via

Access Paper or Ask Questions

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Sep 30, 2024

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li(+13 more)

Figure 1 for MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Figure 2 for MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Figure 3 for MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Figure 4 for MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Abstract:We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Via

Access Paper or Ask Questions

Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Sep 17, 2024

Zeheng Wang, Fangzhou Wang, Liang Li, Zirui Wang, Timothy van der Laan, Ross C. C. Leon, Jing-Kai Huang, Muhammad Usman

Figure 1 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Figure 2 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Figure 3 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Figure 4 for Quantum Machine Learning for Semiconductor Fabrication: Modeling GaN HEMT Contact Process

Abstract:This paper pioneers the use of quantum machine learning (QML) for modeling the Ohmic contact process in GaN high-electron-mobility transistors (HEMTs) for the first time. Utilizing data from 159 devices and variational auto-encoder-based augmentation, we developed a quantum kernel-based regressor (QKR) with a 2-level ZZ-feature map. Benchmarking against six classical machine learning (CML) models, our QKR consistently demonstrated the lowest mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE). Repeated statistical analysis confirmed its robustness. Additionally, experiments verified an MAE of 0.314 ohm-mm, underscoring the QKR's superior performance and potential for semiconductor applications, and demonstrating significant advancements over traditional CML methods.

* This is the manuscript in the conference version. An expanded version for the journal will be released later and more information will be added. The author list, content, conclusion, and figures may change due to further research

Via

Access Paper or Ask Questions

Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning

Sep 11, 2024

Guoyang Xu, Junqi Xue, Yuxin Liu, Zirui Wang, Min Zhang, Zhenxi Song, Zhiguo Zhang

Figure 1 for Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning

Figure 2 for Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning

Figure 3 for Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning

Figure 4 for Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning

Abstract:Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions. However, existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics, thus enhancing the quality of the representations and the robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a semantic-guided fusion module. By evaluating the correlations between different modalities, this module facilitates cross-modal interactions gated by modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model. Our code is available at https://github.com/X-G-Y/SATI.

* change Title, Authors, Abstract

Via

Access Paper or Ask Questions

Robust Temporal-Invariant Learning in Multimodal Disentanglement

Aug 30, 2024

Guoyang Xu, Junqi Xue, Zhenxi Song, Yuxin Liu, Zirui Wang, Min Zhang, Zhiguo Zhang

Figure 1 for Robust Temporal-Invariant Learning in Multimodal Disentanglement

Figure 2 for Robust Temporal-Invariant Learning in Multimodal Disentanglement

Figure 3 for Robust Temporal-Invariant Learning in Multimodal Disentanglement

Figure 4 for Robust Temporal-Invariant Learning in Multimodal Disentanglement

Abstract:Multimodal sentiment recognition aims to learn representations from different modalities to identify human emotions. However, previous works does not suppresses the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose the Temporal-invariant learning, which minimizes the distributional differences between time steps to effectively capture smoother time series patterns, thereby enhancing the quality of the representations and robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a Text-Driven Fusion Module (TDFM). To guide cross-modal interactions, TDFM evaluates the correlations between different modality through modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model.

* 5 pages, 2 figures, this is the first version. The code is available at https://github.com/X-G-Y/RTIL

Via

Access Paper or Ask Questions

CatFree3D: Category-agnostic 3D Object Detection with Diffusion

Aug 22, 2024

Wenjing Bian, Zirui Wang, Andrea Vedaldi

Figure 1 for CatFree3D: Category-agnostic 3D Object Detection with Diffusion

Figure 2 for CatFree3D: Category-agnostic 3D Object Detection with Diffusion

Figure 3 for CatFree3D: Category-agnostic 3D Object Detection with Diffusion

Figure 4 for CatFree3D: Category-agnostic 3D Object Detection with Diffusion

Abstract:Image-based 3D object detection is widely employed in applications such as autonomous vehicles and robotics, yet current systems struggle with generalisation due to complex problem setup and limited training data. We introduce a novel pipeline that decouples 3D detection from 2D detection and depth prediction, using a diffusion-based approach to improve accuracy and support category-agnostic detection. Additionally, we introduce the Normalised Hungarian Distance (NHD) metric for an accurate evaluation of 3D detection results, addressing the limitations of traditional IoU and GIoU metrics. Experimental results demonstrate that our method achieves state-of-the-art accuracy and strong generalisation across various object categories and datasets.

* Project page: https://bianwenjing.github.io/CatFree3D

Via

Access Paper or Ask Questions

GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Aug 20, 2024

Changkun Liu, Shuai Chen, Yash Bhalgat, Siyan Hu, Zirui Wang, Ming Cheng, Victor Adrian Prisacariu, Tristan Braud

Figure 1 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Figure 2 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Figure 3 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Figure 4 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Abstract:We leverage 3D Gaussian Splatting (3DGS) as a scene representation and propose a novel test-time camera pose refinement framework, GSLoc. This framework enhances the localization accuracy of state-of-the-art absolute pose regression and scene coordinate regression methods. The 3DGS model renders high-quality synthetic images and depth maps to facilitate the establishment of 2D-3D correspondences. GSLoc obviates the need for training feature extractors or descriptors by operating directly on RGB images, utilizing the 3D vision foundation model, MASt3R, for precise 2D matching. To improve the robustness of our model in challenging outdoor environments, we incorporate an exposure-adaptive module within the 3DGS framework. Consequently, GSLoc enables efficient pose refinement given a single RGB query and a coarse initial pose estimation. Our proposed approach surpasses leading NeRF-based optimization methods in both accuracy and runtime across indoor and outdoor visual localization benchmarks, achieving state-of-the-art accuracy on two indoor datasets.

* The project page is available at https://gsloc.active.vision

Via

Access Paper or Ask Questions