Abstract:We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.




Abstract:The sympathetic nervous system (SNS) plays a central role in regulating the body's responses to stress and maintaining physiological stability. Its dysregulation is associated with a wide range of conditions, from cardiovascular disease to anxiety disorders. Skin nerve activity (SKNA) extracted from high-frequency electrocardiogram (ECG) recordings provides a noninvasive window into SNS dynamics, but its measurement is highly susceptible to electromyographic (EMG) contamination. Traditional preprocessing based on bandpass filtering within a fixed range (e.g., 500--1000 Hz) is susceptible to overlapping EMG and SKNA spectral components, especially during sustained muscle activity. We present a denoising approach using a lightweight one-dimensional convolutional autoencoder with a long short-term memory (LSTM) bottleneck to reconstruct clean SKNA from EMG-contaminated recordings. Using clean ECG-derived SKNA data from cognitive stress experiments and EMG noise from chaotic muscle stimulation recordings, we simulated contamination at realistic noise levels (--4 dB, --8 dB signal-to-noise ratio) and trained the model in the leave-one-subject-out cross-validation framework. The method improved signal-to-noise ratio by up to 9.65 dB, increased cross correlation with clean SKNA from 0.40 to 0.72, and restored burst-based SKNA features to near-clean discriminability (AUROC $\geq$ 0.96). Classification of baseline versus sympathetic stimulation (cognitive stress) conditions reached accuracies of 91--98\% across severe noise levels, comparable to clean data. These results demonstrate that deep learning--based reconstruction can preserve physiologically relevant sympathetic bursts during substantial EMG interference, enabling more robust SKNA monitoring in naturalistic, movement-rich environments.
Abstract:Most deep learning models of multiclass arrhythmia classification are tested on fingertip photoplethysmographic (PPG) data, which has higher signal-to-noise ratios compared to smartwatch-derived PPG, and the best reported sensitivity value for premature atrial/ventricular contraction (PAC/PVC) detection is only 75%. To improve upon PAC/PVC detection sensitivity while maintaining high AF detection, we use multi-modal data which incorporates 1D PPG, accelerometers, and heart rate data as the inputs to a computationally efficient 1D bi-directional Gated Recurrent Unit (1D-Bi-GRU) model to detect three arrhythmia classes. We used motion-artifact prone smartwatch PPG data from the NIH-funded Pulsewatch clinical trial. Our multimodal model tested on 72 subjects achieved an unprecedented 83% sensitivity for PAC/PVC detection while maintaining a high accuracy of 97.31% for AF detection. These results outperformed the best state-of-the-art model by 20.81% for PAC/PVC and 2.55% for AF detection even while our model was computationally more efficient (14 times lighter and 2.7 faster).