Abstract:This article proposes a robust brain-inspired audio feature extractor (RBA-FE) model for depression diagnosis, using an improved hierarchical network architecture. Most deep learning models achieve state-of-the-art performance for image-based diagnostic tasks, ignoring the counterpart audio features. In order to tailor the noise challenge, RBA-FE leverages six acoustic features extracted from the raw audio, capturing both spatial characteristics and temporal dependencies. This hybrid attribute helps alleviate the precision limitation in audio feature extraction within other learning models like deep residual shrinkage networks. To deal with the noise issues, our model incorporates an improved spiking neuron model, called adaptive rate smooth leaky integrate-and-fire (ARSLIF). The ARSLIF model emulates the mechanism of ``retuning of cellular signal selectivity" in the brain attention systems, which enhances the model robustness against environmental noises in audio data. Experimental results demonstrate that RBA-FE achieves state-of-the-art accuracy on the MODMA dataset, respectively with 0.8750, 0.8974, 0.8750 and 0.8750 in precision, accuracy, recall and F1 score. Extensive experiments on the AVEC2014 and DAIC-WOZ datasets both show enhancements in noise robustness. It is further indicated by comparison that the ARSLIF neuron model suggest the abnormal firing pattern within the feature extraction on depressive audio data, offering brain-inspired interpretability.
Abstract:The rapid advancement of artificial intelligence (AI) has enabled sophisticated audio generation and voice cloning technologies, posing significant security risks for applications reliant on voice authentication. While existing datasets and models primarily focus on distinguishing between human and fully synthetic speech, real-world attacks often involve audio that combines both genuine and cloned segments. To address this gap, we construct a novel hybrid audio dataset incorporating human, AI-generated, cloned, and mixed audio samples. We further propose fine-tuned Audio Spectrogram Transformer (AST)-based models tailored for detecting these complex acoustic patterns. Extensive experiments demonstrate that our approach significantly outperforms existing baselines in mixed-audio detection, achieving 97\% classification accuracy. Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech-based authentication systems.
Abstract:This article proposes a brain-inspired generative (BIG) model that merges an impulsive-attention neural network and a variational autoencoder (VAE) for identifying cognitive states based on electroencephalography (EEG) data. A hybrid learning method is presented for training the model by integrating gradient-based learning and heteroassociative memory. The BIG model is capable of achieving multi-task objectives: classification, generating new EEG, and brain network interpretation, alleviating the limitations of excessive data training and high computational cost in conventional approaches. Experimental results on two public EEG datasets demonstrate that the BIG model achieves a classification accuracy above 89\%, comparable with state-of-the-art methods, while reducing computational cost by nearly 11\% over the baseline EEGNet. Incorporating the generated EEG data for training, the BIG model exhibits comparative performance in a few-shot pattern.Ablation studies justify the poised brain-inspired characteristic regarding the impulsive-attention module and the hybrid learning method. Thanks to the performance advantages with interpretable outputs, this BIG model has application potential for building digital twins of the brain.
Abstract:Contraction metrics are crucial in control theory because they provide a powerful framework for analyzing stability, robustness, and convergence of various dynamical systems. However, identifying these metrics for complex nonlinear systems remains an open challenge due to the lack of scalable and effective tools. This paper explores the approach of learning verifiable contraction metrics parametrized as neural networks (NNs) for discrete-time nonlinear dynamical systems. While prior works on formal verification of contraction metrics for general nonlinear systems have focused on convex optimization methods (e.g. linear matrix inequalities, etc) under the assumption of continuously differentiable dynamics, the growing prevalence of NN-based controllers, often utilizing ReLU activations, introduces challenges due to the non-smooth nature of the resulting closed-loop dynamics. To bridge this gap, we establish a new sufficient condition for establishing formal neural contraction metrics for general discrete-time nonlinear systems assuming only the continuity of the dynamics. We show that from a computational perspective, our sufficient condition can be efficiently verified using the state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN, which scales up non-convex neural network verification via novel integration of symbolic linear bound propagation and branch-and-bound. Built upon our analysis tool, we further develop a learning method for synthesizing neural contraction metrics from sampled data. Finally, our approach is validated through the successful synthesis and verification of NN contraction metrics for various nonlinear examples.
Abstract:We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.
Abstract:Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.
Abstract:Neural Architecture Search (NAS) for deep learning object detection frameworks typically involves multiple modules, each performing distinct tasks. These modules contribute to a vast search space, resulting in searches that can take several GPU hours or even days, depending on the complexity of the search space. This makes joint optimization both challenging and computationally expensive. Furthermore, satisfying target device constraints across modules adds additional complexity to the optimization process. To address these challenges, we propose \textbf{FACETS}, e\textbf{\underline{F}}ficient Once-for-\textbf{\underline{A}}ll Object Detection via \textbf{\underline{C}}onstrained it\textbf{\underline{E}}ra\textbf{\underline{T}}ive\textbf{\underline{S}}earch, a novel unified iterative NAS method that refines the architecture of all modules in a cyclical manner. FACETS leverages feedback from previous iterations, alternating between fixing one module's architecture and optimizing the others. This approach reduces the overall search space while preserving interdependencies among modules and incorporates constraints based on the target device's computational budget. In a controlled comparison against progressive and single-module search strategies, FACETS achieves architectures with up to $4.75\%$ higher accuracy twice as fast as progressive search strategies in earlier stages, while still being able to achieve a global optimum. Moreover, FACETS demonstrates the ability to iteratively refine the search space, producing better performing architectures over time. The refined search space yields candidates with a mean accuracy up to $27\%$ higher than global search and $5\%$ higher than progressive search methods via random sampling.
Abstract:Computing tight Lipschitz bounds for deep neural networks is crucial for analyzing their robustness and stability, but existing approaches either produce relatively conservative estimates or rely on semidefinite programming (SDP) formulations (namely the LipSDP condition) that face scalability issues. Building upon ECLipsE-Fast, the state-of-the-art Lipschitz bound method that avoids SDP formulations, we derive a new family of improved scalable Lipschitz bounds that can be combined to outperform ECLipsE-Fast. Specifically, we leverage more general parameterizations of feasible points of LipSDP to derive various closed-form Lipschitz bounds, avoiding the use of SDP solvers. In addition, we show that our technique encompasses ECLipsE-Fast as a special case and leads to a much larger class of scalable Lipschitz bounds for deep neural networks. Our empirical study shows that our bounds improve ECLipsE-Fast, further advancing the scalability and precision of Lipschitz estimation for large neural networks.
Abstract:The image translation method represents a crucial approach for mitigating information deficiencies in the infrared and visible modalities, while also facilitating the enhancement of modality-specific datasets. However, existing methods for infrared and visible image translation either achieve unidirectional modality translation or rely on cycle consistency for bidirectional modality translation, which may result in suboptimal performance. In this work, we present the cross-modality translation diffusion model (CM-Diff) for simultaneously modeling data distributions in both the infrared and visible modalities. We address this challenge by combining translation direction labels for guidance during training with cross-modality feature control. Specifically, we view the establishment of the mapping relationship between the two modalities as the process of learning data distributions and understanding modality differences, achieved through a novel Bidirectional Diffusion Training (BDT) strategy. Additionally, we propose a Statistical Constraint Inference (SCI) strategy to ensure the generated image closely adheres to the data distribution of the target modality. Experimental results demonstrate the superiority of our CM-Diff over state-of-the-art methods, highlighting its potential for generating dual-modality datasets.
Abstract:Cross-subject variability in EEG degrades performance of current deep learning models, limiting the development of brain-computer interface (BCI). This paper proposes ISAM-MTL, which is a multi-task learning (MTL) EEG classification model based on identifiable spiking (IS) representations and associative memory (AM) networks. The proposed model treats EEG classification of each subject as an independent task and leverages cross-subject data training to facilitate feature sharing across subjects. ISAM-MTL consists of a spiking feature extractor that captures shared features across subjects and a subject-specific bidirectional associative memory network that is trained by Hebbian learning for efficient and fast within-subject EEG classification. ISAM-MTL integrates learned spiking neural representations with bidirectional associative memory for cross-subject EEG classification. The model employs label-guided variational inference to construct identifiable spike representations, enhancing classification accuracy. Experimental results on two BCI Competition datasets demonstrate that ISAM-MTL improves the average accuracy of cross-subject EEG classification while reducing performance variability among subjects. The model further exhibits the characteristics of few-shot learning and identifiable neural activity beneath EEG, enabling rapid and interpretable calibration for BCI systems.