Michael
Abstract:Artificial Intelligence (AI) has shown great promise in electrocardiogram (ECG) analysis and cardiovascular disease detection. However, developing a general AI-ECG model has been challenging due to inter-individual variability and the diversity of ECG diagnoses, limiting existing models to specific diagnostic tasks and datasets. Moreover, current AI-ECG models struggle to achieve comparable performance between single-lead and 12-lead ECGs, limiting the application of AI-ECG to portable and wearable ECG devices. To address these limitations, we introduce an ECG Foundation Model (ECGFounder), a general-purpose model that leverages real-world ECG annotations from cardiology experts to broaden the diagnostic capabilities of ECG analysis. ECGFounder is trained on over 10 million ECGs with 150 label categories from the Harvard-Emory ECG Database, enabling comprehensive cardiovascular disease diagnosis through ECG analysis. The model is designed to be both effective out-of-the-box and fine-tunable for downstream tasks, maximizing usability. More importantly, we extend its application to single-lead ECGs, enabling complex condition diagnoses and supporting various downstream tasks in mobile and remote monitoring scenarios. Experimental results demonstrate that ECGFounder achieves expert-level performance on internal validation sets for both 12-lead and single-lead ECGs, while also exhibiting strong classification performance and generalization across various diagnoses on external validation sets. When fine-tuned, ECGFounder outperforms baseline models in demographics detection, clinical event detection, and cross-modality cardiac rhythm diagnosis. The trained model and data will be publicly released upon publication through the bdsp.io. Our code is available at https://github.com/bdsp-core/ECGFounder.
Abstract:This paper presents a novel multi-stream downlink communication system that utilizes a transmissive reconfigurable intelligent surface (RIS) transceiver. Specifically, we elaborate the downlink communication scheme using time-modulated array (TMA) technology, which enables high order modulation and multi-stream beamforming. Then, an optimization problem is formulated to maximize the minimum signal-to-interference-plusnoise ratio (SINR) with user fairness, which takes into account the constraint of the maximum available power for each transmissive element. Due to the non-convex nature of the formulated problem,finding optimal solution is challenging. To mitigate the complexity,we propose a linear-complexity beamforming algorithm based on consensus alternating direction method of multipliers (ADMM).Specifically, by introducing a set of auxiliary variables, the problem can be decomposed into multiple sub-problems that are amenable to parallel computation, where the each sub-problem can yield closed-form expressions, bringing a significant reduction in the computational complexity. The overall problem achieves convergence by iteratively addressing these sub-problems in an alternating manner. Finally, the convergence of the proposed algorithm and the impact of various parameter configurations on the system performance are validated through numerical simulations.




Abstract:Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.
Abstract:Reconstructing 3D coronary arteries is important for coronary artery disease diagnosis, treatment planning and operation navigation. Traditional reconstruction techniques often require many projections, while reconstruction from sparse-view X-ray projections is a potential way of reducing radiation dose. However, the extreme sparsity of coronary arteries in a 3D volume and ultra-limited number of projections pose significant challenges for efficient and accurate 3D reconstruction. To this end, we propose 3DGR-CAR, a 3D Gaussian Representation for Coronary Artery Reconstruction from ultra-sparse X-ray projections. We leverage 3D Gaussian representation to avoid the inefficiency caused by the extreme sparsity of coronary artery data and propose a Gaussian center predictor to overcome the noisy Gaussian initialization from ultra-sparse view projections. The proposed scheme enables fast and accurate 3D coronary artery reconstruction with only 2 views. Experimental results on two datasets indicate that the proposed approach significantly outperforms other methods in terms of voxel accuracy and visual quality of coronary arteries. The code will be available in https://github.com/windrise/3DGR-CAR.




Abstract:This paper introduces the model structure used in the SVDD 2024 Challenge. The SVDD 2024 challenge has been introduced this year for the first time. Singing voice deepfake detection (SVDD) which faces complexities due to informal speech intonations and varying speech rates. In this paper, we propose the XWSB system, which achieved SOTA per-formance in the SVDD challenge. XWSB stands for XLS-R, WavLM, and SLS Blend, representing the integration of these technologies for the purpose of SVDD. Specifically, we used the best performing model structure XLS-R&SLS from the ASVspoof DF dataset, and applied SLS to WavLM to form the WavLM&SLS structure. Finally, we integrated two models to form the XWSB system. Experimental results show that our system demonstrates advanced recognition capabilities in the SVDD challenge, specifically achieving an EER of 2.32% in the CtrSVDD track. The code and data can be found at https://github.com/QiShanZhang/XWSB_for_ SVDD2024.




Abstract:With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a "black box." This paper introduces a novel type of loss function, termed "Entropy Loss," along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47\% compared to models without Entropy Loss, underscoring the method's efficacy. The implementation code is available at \url{https://github.com/yhbcode000/Eloss-Interpretability}.




Abstract:With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, existing pose-level methods suffer from the shortcomings that the single coordinate is not stable enough to handle action distortions due to changes in camera viewpoints, thus failing to accurately identify salient poses, and is vulnerable to misdetection during the transition from the exception to the actual action. To overcome these problems, we propose a simple but efficient Global Multi-geometric Feature Learning Network (GMFL-Net). Specifically, we design a MIA-Module that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features. Then, to improve the feature representation from a global perspective, we also design a GBFL-Module that enhances the inter-dependencies between point-wise and channel-wise elements and combines them with the rich local information generated by the MIA-Module to synthesise a comprehensive and most representative global feature representation. In addition, considering the insufficient existing dataset, we collect a new dataset called Countix-Fitness-pose (https://github.com/Wantong66/Countix-Fitness) which contains different cycle lengths and exceptions, a test set with longer duration, and annotate it with fine-grained annotations at the pose-level. We also add two new action classes, namely lunge and rope push-down. Finally, extensive experiments on the challenging RepCount-pose, UCFRep-pose, and Countix-Fitness-pose benchmarks show that our proposed GMFL-Net achieves state-of-the-art performance.




Abstract:Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans' sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and the Passenger EEG Decoding Strategy (PEDS). Central to PEDS is a novel Convolutional Recurrent Neural Network (CRNN) that captures spatial and temporal EEG data patterns. The CRNN, combined with stacking algorithms, achieves an accuracy of $85.0\% \pm 3.18\%$. Our findings highlight the predictive power of pre-event EEG data, enhancing the detection of hazardous scenarios and offering a network-driven framework for safer autonomous vehicles.




Abstract:Reconfigurable intelligent surface (RIS) is anticipated to augment the performance of beyond fifth-generation (B5G) and sixth-generation (6G) networks by intelligently manipulating the state of its components. Rather than employing reflective RIS for aided communications, this paper proposes an innovative transmissive RIS-enabled transceiver (TRTC) architecture that can accomplish the functions of traditional multi-antenna systems in a cost-effective and energy-efficient manner. First, the proposed network architecture and its corresponding transmission scheme are elaborated from the perspectives of downlink (DL) and uplink (UL) transmissions. Then, we illustrate several significant advantages and differences of TRTC compared to other multiantenna systems. Furthermore, the downlink modulation and extraction principle based on time-modulation array (TMA) is introduced in detail to tackle the multi-stream communications. Moreover, a near-far field channel model appropriate for this architecture is proposed. Based on the channel model, we summarize some state-of-the-art channel estimation schemes, and the channel estimation scheme of TRTC is also provided. Considering the optimization for DL and UL communications, we present numerical simulations that confirm the superiority of the proposed optimization algorithm. Lastly, numerous prospective research avenues for TRTC systems are delineated to inspire further exploration.




Abstract:In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional counterparts.To bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model.