Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.
High-performance room-temperature sensing is often limited by non-stationary $1/f$ fluctuations and non-Gaussian stochasticity. In spintronic devices, thermally activated Néel switching creates heavy-tailed noise that masks weak signals, defeating linear filters optimized for Gaussian statistics. Here, we introduce a physics-integrated inference framework that decouples signal morphology from stochastic transients using a hierarchical 1D CNN-GRU topology. By learning the temporal signatures of Néel relaxation, this architecture reduces the Noise Equivalent Differential Temperature (NEDT) of spintronic Poisson bolometers by a factor of six (233.78 mK to 40.44 mK), effectively elevating room-temperature sensitivity toward cryogenic limits. We demonstrate the framework's universality across the electromagnetic and biological spectrum, achieving a 9-fold error suppression in Radar tracking, a 40\% uncertainty reduction in LiDAR, and a 15.56 dB SNR enhancement in ECG. This hardware-inference coupling recovers deterministic signals from fluctuation-dominated regimes, enabling near-ideal detection limits in noisy edge environments.
Auditory attention decoding (AAD) identifies the attended speech stream in multi-speaker environments by decoding brain signals such as electroencephalography (EEG). This technology is essential for realizing smart hearing aids that address the cocktail party problem and for facilitating objective audiometry systems. Existing AAD research mainly utilizes dichotic environments where different speech signals are presented to the left and right ears, enabling models to classify directional attention rather than speech content. However, this spatial reliance limits applicability to real-world scenarios, such as the "cocktail party" situation, where speakers overlap or move dynamically. To address this challenge, we propose an AAD framework for diotic environments where identical speech mixtures are presented to both ears, eliminating spatial cues. Our approach maps EEG and speech signals into a shared latent space using independent encoders. We extract speech features using wav2vec 2.0 and encode them with a 2-layer 1D convolutional neural network (CNN), while employing the BrainNetwork architecture for EEG encoding. The model identifies the attended speech by calculating the cosine similarity between EEG and speech representations. We evaluate our method on a diotic EEG dataset and achieve 72.70% accuracy, which is 22.58% higher than the state-of-the-art direction-based AAD method.
Although deep learning has advanced automated electrocardiogram (ECG) diagnosis, prevalent supervised methods typically treat recordings as undifferentiated one-dimensional (1D) signals or two-dimensional (2D) images. This formulation compels models to learn physiological structures implicitly, resulting in data inefficiency and opacity that diverge from medical reasoning. To address these limitations, we propose BEAT-Net, a Biomimetic ECG Analysis with Tokenization framework that reformulates the problem as a language modeling task. Utilizing a QRS tokenization strategy to transform continuous signals into biologically aligned heartbeat sequences, the architecture explicitly decomposes cardiac physiology through specialized encoders that extract local beat morphology while normalizing spatial lead perspectives and modeling temporal rhythm dependencies. Evaluations across three large-scale benchmarks demonstrate that BEAT-Net matches the diagnostic accuracy of dominant convolutional neural network (CNN) architectures while substantially improving robustness. The framework exhibits exceptional data efficiency, recovering fully supervised performance using only 30 to 35 percent of annotated data. Moreover, learned attention mechanisms provide inherent interpretability by spontaneously reproducing clinical heuristics, such as Lead II prioritization for rhythm analysis, without explicit supervision. These findings indicate that integrating biological priors offers a computationally efficient and interpretable alternative to data-intensive large-scale pre-training.
Genomic prediction of drug resistance in Mycobacterium tuberculosis remains challenging due to complex epistatic interactions and highly variable sequencing data quality. We present a novel Interpretable Variant-Aware Multi-Path Network (VAMP-Net) that addresses both challenges through complementary machine learning pathways. Path-1 employs a Set Attention Transformer processing permutation-invariant variant sets to capture epistatic interactions between genomic loci. Path-2 utilizes a 1D Convolutional Neural Network that analyzes Variant Call Format quality metrics to learn adaptive confidence scores. A fusion module combines both pathways for final resistance classification. We conduct comparative evaluations of unmasked versus padding-masked Set Attention Blocks, and demonstrate that our multi-path architecture achieves superior performance over baseline CNN and MLP models, with accuracy exceeding 95% and AUC around 97% for Rifampicin (RIF) and Rifabutin (RFB) resistance prediction. The framework provides dual-layer interpretability: Attention Weight Analysis reveals Epistatic networks, and Integrated Gradients (IG) was applied for critical resistance loci (notably rpoB), while gradient-based feature importance from the CNN pathway uncovers drug-specific dependencies on data quality metrics. This architecture advances clinical genomics by delivering state-of-the-art predictive performance alongside auditable interpretability at two distinct levels, genetic causality of mutation sets and technical confidence of sequencing evidence, establishing a new paradigm for robust, clinically-actionable resistance prediction.
A fundamental limitation of supervised deep learning in high-dimensional tabular domains is "Generalization Collapse": models learn precise decision boundaries for known distributions but fail catastrophically when facing Out-of-Distribution (OOD) data. We hypothesize that this failure stems from the lack of topological constraints in the latent space, resulting in diffuse manifolds where novel anomalies remain statistically indistinguishable from benign data. To address this, we propose Latent Sculpting, a hierarchical two-stage representation learning framework. Stage 1 utilizes a hybrid 1D-CNN and Transformer Encoder trained with a novel Dual-Centroid Compactness Loss (DCCL) to actively "sculpt" benign traffic into a low-entropy, hyperspherical cluster. Unlike standard contrastive losses that rely on triplet mining, DCCL optimizes global cluster centroids to enforce absolute manifold density. Stage 2 conditions a Masked Autoregressive Flow (MAF) on this pre-structured manifold to learn an exact density estimate. We evaluate this methodology on the rigorous CIC-IDS-2017 benchmark, treating it as a proxy for complex, non-stationary data streams. Empirical results demonstrate that explicit manifold sculpting is a prerequisite for robust zero-shot generalization. While supervised baselines suffered catastrophic performance collapse on unseen distribution shifts (F1 approx 0.30) and the strongest unsupervised baseline achieved only 0.76, our framework achieved an F1-Score of 0.87 on strictly zero-shot anomalies. Notably, we report an 88.89% detection rate on "Infiltration" scenarios--a complex distributional shift where state-of-the-art supervised models achieved 0.00% accuracy. These findings suggest that decoupling structure learning from density estimation provides a scalable path toward generalized anomaly detection.
Accurate estimation of subsurface material properties, such as soil moisture, is critical for wildfire risk assessment and precision agriculture. Ground-penetrating radar (GPR) is a non-destructive geophysical technique widely used to characterize subsurface conditions. Data-driven parameter estimation methods typically require large amounts of labeled training data, which is expensive to obtain from real-world GPR scans under diverse subsurface conditions. A physics-based GPR model using the finite-difference time-domain (FDTD) method can be employed to generate large synthetic datasets through simulations across varying material parameters, which are then utilized to train data-driven models. A key limitation, however, is that simulated data (source domain) and real-world data (target domain) often follow different distributions, which can cause data-driven models trained on simulations to underperform in real-world scenarios. To address this challenge, this study proposes a novel physics-guided hierarchical domain adaptation framework with deep adversarial learning for robust subsurface material property estimation from GPR signals. The proposed framework is systematically evaluated through the laboratory tests for single- and two-layer materials, as well as the field tests for single- and two-layer materials, and is benchmarked against state-of-the-art methods, including the one-dimensional convolutional neural network (1D CNN) and domain adversarial neural network (DANN). The results demonstrate that the proposed framework achieves higher correlation coefficients R and lower Bias between the predicted and measured parameter values, along with smaller standard deviations in the estimations, thereby validating their effectiveness in bridging the domain gap between simulated and real-world radar signals and enabling efficient subsurface material property retrieval.




Forecasting meteorological variables is challenging due to the complexity of their processes, requiring advanced models for accuracy. Accurate precipitation forecasts are vital for society. Reliable predictions help communities mitigate climatic impacts. Based on the current relevance of artificial intelligence (AI), classical machine learning (ML) and deep learning (DL) techniques have been used as an alternative or complement to dynamic modeling. However, there is still a lack of broad investigations into the feasibility of purely data-driven approaches for precipitation forecasting. This study aims at addressing this issue where different classical ML and DL approaches for forecasting precipitation in South America, taking into account all 2019 seasons, are considered in a detailed investigation. The selected classical ML techniques were Random Forests and extreme gradient boosting (XGBoost), while the DL counterparts were a 1D convolutional neural network (CNN 1D), a long short-term memory (LSTM) model, and a gated recurrent unit (GRU) model. Additionally, the Brazilian Global Atmospheric Model (BAM) was used as a representative of the traditional dynamic modeling approach. We also relied on explainable artificial intelligence (XAI) to provide some explanations for the models behaviors. LSTM showed strong predictive performance while BAM, the traditional dynamic model representative, had the worst results. Despite presented the higher latency, LSTM was most accurate for heavy precipitation. If cost is a concern, XGBoost offers lower latency with slightly accuracy loss. The results of this research confirm the viability of DL models for climate forecasting, solidifying a global trend in major meteorological and climate forecasting centers.




Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using real-world noise profiles from the SAS-KIIT dataset. The proposed framework demonstrates superior generalization and state-of-the-art accuracy across all datasets, significantly outperforming single-branch baselines under realistic environmental interference. Furthermore, we address the ``black-box" problem by integrating SHAP and Score-CAM into the evaluation pipeline. These tools provide granular visual explanations, revealing how the model strategically shifts attention between temporal and spectral cues to maintain reliability in the presence of complex environmental noise.




Predicting the binding affinity between antigens and antibodies is fundamental to drug discovery and vaccine development. Traditional computational approaches often rely on experimentally determined 3D structures, which are scarce and computationally expensive to obtain. This paper introduces DuaDeep-SeqAffinity, a novel sequence-only deep learning framework that predicts affinity scores solely from their amino acid sequences using a dual-stream hybrid architecture. Our approach leverages pre-trained ESM-2 protein language model embeddings, combining 1D Convolutional Neural Networks (CNNs) for local motif detection with Transformer encoders for global contextual representation. A subsequent fusion module integrates these multi-faceted features, which are then passed to a fully connected network for final score regression. Experimental results demonstrate that DuaDeep-SeqAffinity significantly outperforms individual architectural components and existing state-of-the-art (SOTA) methods. DuaDeep achieved a superior Pearson correlation of 0.688, an R^2 of 0.460, and a Root Mean Square Error (RMSE) of 0.737, surpassing single-branch variants ESM-CNN and ESM-Transformer. Notably, the model achieved an Area Under the Curve (AUC) of 0.890, outperforming sequence-only benchmarks and even surpassing structure-sequence hybrid models. These findings prove that high-fidelity sequence embeddings can capture essential binding patterns typically reserved for structural modeling. By eliminating the reliance on 3D structures, DuaDeep-SeqAffinity provides a highly scalable and efficient solution for high-throughput screening of vast sequence libraries, significantly accelerating the therapeutic discovery pipeline.