The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
We propose a novel framework that leverages LLMs for full causal graph discovery. While previous LLM-based methods have used a pairwise query approach, this requires a quadratic number of queries which quickly becomes impractical for larger causal graphs. In contrast, the proposed framework uses a breadth-first search (BFS) approach which allows it to use only a linear number of queries. We also show that the proposed method can easily incorporate observational data when available, to improve performance. In addition to being more time and data-efficient, the proposed framework achieves state-of-the-art results on real-world causal graphs of varying sizes. The results demonstrate the effectiveness and efficiency of the proposed method in discovering causal relationships, showcasing its potential for broad applicability in causal graph discovery tasks across different domains.
Age of incorrect information (AoII) has recently been proposed as an alternative to existing information freshness metrics for real-time sampling and estimation problems involving information sources that are tracked by remote monitors. Different from existing metrics, AoII penalizes the incorrect information by increasing linearly with time as long as the source and the monitor are de-synchronized, and is reset when they are synchronized back. While AoII has generally been investigated for discrete time information sources, we develop a novel analytical model in this paper for push- and pull-based sampling and transmission of a continuous time Markov chain (CTMC) process. In the pull-based model, the sensor starts transmitting information on the observed CTMC only when a pull request from the monitor is received. On the other hand, in the push-based scenario, the sensor, being aware of the AoII process, samples and transmits when the AoII process exceeds a random threshold. The proposed analytical model for both scenarios is based on the construction of a discrete time MC (DTMC) making state transitions at the embedded epochs of synchronization points, using the theory of absorbing CTMCs, and in particular phase-type distributions. For a given sampling policy, analytical models to obtain the mean AoII and the average sampling rate are developed. Numerical results are presented to validate the analytical model as well as to provide insight on optimal sampling policies under sampling rate constraints.
A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.
An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.
In general, underwater images suffer from color distortion and low contrast, because light is attenuated and backscattered as it propagates through water (differently depending on wavelength and on the properties of the water body). An existing simple degradation model (similar to atmospheric image "hazing" effects), though helpful, is not sufficient to properly represent the underwater image degradation because there are unaccounted for and non-measurable factors e.g. scattering of light due to turbidity of water, reflective characteristics of turbid medium etc. We propose a deep learning-based architecture to automatically simulate the underwater effects where only a dehazing-like image formation equation is known to the network, and the additional degradation due to the other unknown factors if inferred in a data-driven way. We only use RGB images (because in real-time scenario depth image is not available) to estimate the depth image. For testing, we have proposed (due to the lack of real underwater image datasets) a complex image formation model/equation to manually generate images that resemble real underwater images (used as ground truth). However, only the classical image formation equation (the one used for image dehazing) is informed to the network. This mimics the fact that in a real scenario, the physics are never completely known and only simplified models are known. Thanks to the ground truth, generated by a complex image formation equation, we could successfully perform a qualitative and quantitative evaluation of proposed technique, compared to other purely data driven approaches
Video sequences exhibit significant nuisance variations (undesired effects) of speed of actions, temporal locations, and subjects' poses, leading to temporal-viewpoint misalignment when comparing two sets of frames or evaluating the similarity of two sequences. Thus, we propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE) for sequence pairs. In particular, we focus on 3D skeleton sequences whose camera and subjects' poses can be easily manipulated in 3D. We evaluate JEANIE on skeletal Few-shot Action Recognition (FSAR), where matching well temporal blocks (temporal chunks that make up a sequence) of support-query sequence pairs (by factoring out nuisance variations) is essential due to limited samples of novel classes. Given a query sequence, we create its several views by simulating several camera locations. For a support sequence, we match it with view-simulated query sequences, as in the popular Dynamic Time Warping (DTW). Specifically, each support temporal block can be matched to the query temporal block with the same or adjacent (next) temporal index, and adjacent camera views to achieve joint local temporal-viewpoint warping. JEANIE selects the smallest distance among matching paths with different temporal-viewpoint warping patterns, an advantage over DTW which only performs temporal alignment. We also propose an unsupervised FSAR akin to clustering of sequences with JEANIE as a distance measure. JEANIE achieves state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II on supervised and unsupervised FSAR, and their meta-learning inspired fusion.
In this study, we introduce BirdNeRF, an adaptation of Neural Radiance Fields (NeRF) designed specifically for reconstructing large-scale scenes using aerial imagery. Unlike previous research focused on small-scale and object-centric NeRF reconstruction, our approach addresses multiple challenges, including (1) Addressing the issue of slow training and rendering associated with large models. (2) Meeting the computational demands necessitated by modeling a substantial number of images, requiring extensive resources such as high-performance GPUs. (3) Overcoming significant artifacts and low visual fidelity commonly observed in large-scale reconstruction tasks due to limited model capacity. Specifically, we present a novel bird-view pose-based spatial decomposition algorithm that decomposes a large aerial image set into multiple small sets with appropriately sized overlaps, allowing us to train individual NeRFs of sub-scene. This decomposition approach not only decouples rendering time from the scene size but also enables rendering to scale seamlessly to arbitrarily large environments. Moreover, it allows for per-block updates of the environment, enhancing the flexibility and adaptability of the reconstruction process. Additionally, we propose a projection-guided novel view re-rendering strategy, which aids in effectively utilizing the independently trained sub-scenes to generate superior rendering results. We evaluate our approach on existing datasets as well as against our own drone footage, improving reconstruction speed by 10x over classical photogrammetry software and 50x over state-of-the-art large-scale NeRF solution, on a single GPU with similar rendering quality.
Topology optimization is a critical task in engineering design, where the goal is to optimally distribute material in a given space for maximum performance. We introduce Neural Implicit Topology Optimization (NITO), a novel approach to accelerate topology optimization problems using deep learning. NITO stands out as one of the first frameworks to offer a resolution-free and domain-agnostic solution in deep learning-based topology optimization. NITO synthesizes structures with up to seven times better structural efficiency compared to SOTA diffusion models and does so in a tenth of the time. In the NITO framework, we introduce a novel method, the Boundary Point Order-Invariant MLP (BPOM), to represent boundary conditions in a sparse and domain-agnostic manner, moving away from expensive simulation-based approaches. Crucially, NITO circumvents the domain and resolution limitations that restrict Convolutional Neural Network (CNN) models to a structured domain of fixed size -- limitations that hinder the widespread adoption of CNNs in engineering applications. This generalizability allows a single NITO model to train and generate solutions in countless domains, eliminating the need for numerous domain-specific CNNs and their extensive datasets. Despite its generalizability, NITO outperforms SOTA models even in specialized tasks, is an order of magnitude smaller, and is practically trainable at high resolutions that would be restrictive for CNNs. This combination of versatility, efficiency, and performance underlines NITO's potential to transform the landscape of engineering design optimization problems through implicit fields.
Rehabilitation tasks demand robust and accurate trajectory-tracking performance, mainly achieved with parallel robots. In this field, limiting the value of the force exerted on the patient is crucial, especially when an injured limb is involved. In human-robot interaction studies, the admittance controller modifies the location of the robot according to the user efforts driving the end-effector to an arbitrary location within the workspace. However, a parallel robot has singularities within the workspace, making implementing a conventional admittance controller unsafe. Thus, this study proposes an admittance controller that overcomes the limitations of singular configurations by using a real-time singularity avoidance algorithm. The singularity avoidance algorithm modifies the original trajectory based on the actual location of the parallel robot. The complemented admittance controller is applied to a 4 degrees of freedom parallel robot for knee rehabilitation. In this case, the actual location is measured by a 3D tracking system because the location calculated by the forward kinematics is inaccurate in the vicinity of a singularity. The experimental results verify the effectiveness of the proposed admittance controller for safe knee rehabilitation exercises