Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Time": models, code, and papers

Focal Modulation Networks for Interpretable Sound Classification

Feb 05, 2024
Luca Della Libera, Cem Subakan, Mirco Ravanelli

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

* Accepted to ICASSP 2024 XAI-SA Workshop

Via

Access Paper or Ask Questions

Efficient Causal Graph Discovery Using Large Language Models

Feb 05, 2024
Thomas Jiralerspong, Xiaoyin Chen, Yash More, Vedant Shah, Yoshua Bengio

We propose a novel framework that leverages LLMs for full causal graph discovery. While previous LLM-based methods have used a pairwise query approach, this requires a quadratic number of queries which quickly becomes impractical for larger causal graphs. In contrast, the proposed framework uses a breadth-first search (BFS) approach which allows it to use only a linear number of queries. We also show that the proposed method can easily incorporate observational data when available, to improve performance. In addition to being more time and data-efficient, the proposed framework achieves state-of-the-art results on real-world causal graphs of varying sizes. The results demonstrate the effectiveness and efficiency of the proposed method in discovering causal relationships, showcasing its potential for broad applicability in causal graph discovery tasks across different domains.

Via

Access Paper or Ask Questions

Modeling AoII in Push- and Pull-Based Sampling of Continuous Time Markov Chains

Jan 11, 2024
Ismail Cosandal, Nail Akar, Sennur Ulukus

Age of incorrect information (AoII) has recently been proposed as an alternative to existing information freshness metrics for real-time sampling and estimation problems involving information sources that are tracked by remote monitors. Different from existing metrics, AoII penalizes the incorrect information by increasing linearly with time as long as the source and the monitor are de-synchronized, and is reset when they are synchronized back. While AoII has generally been investigated for discrete time information sources, we develop a novel analytical model in this paper for push- and pull-based sampling and transmission of a continuous time Markov chain (CTMC) process. In the pull-based model, the sensor starts transmitting information on the observed CTMC only when a pull request from the monitor is received. On the other hand, in the push-based scenario, the sensor, being aware of the AoII process, samples and transmits when the AoII process exceeds a random threshold. The proposed analytical model for both scenarios is based on the construction of a discrete time MC (DTMC) making state transitions at the embedded epochs of synchronization points, using the theory of absorbing CTMCs, and in particular phase-type distributions. For a given sampling policy, analytical models to obtain the mean AoII and the average sampling rate are developed. Numerical results are presented to validate the analytical model as well as to provide insight on optimal sampling policies under sampling rate constraints.

Via

Access Paper or Ask Questions

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

Jan 10, 2024
Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

* This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

Via

Access Paper or Ask Questions

Modeling of learning curves with applications to pos tagging

Feb 04, 2024
Manuel Vilares Ferro, Victor M. Darriba Bilbao, Francisco J. Ribadas Pena

An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.

* Computer Speech & Language, 41, pp 1-28 (2017). ISSN 0885-2308. Elsevier
* 30 pages, 11 figures

Via

Access Paper or Ask Questions

Physics Informed and Data Driven Simulation of Underwater Images via Residual Learning

Feb 07, 2024
Tanmoy Mondal, Ricardo Mendoza, Lucas Drumetz

In general, underwater images suffer from color distortion and low contrast, because light is attenuated and backscattered as it propagates through water (differently depending on wavelength and on the properties of the water body). An existing simple degradation model (similar to atmospheric image "hazing" effects), though helpful, is not sufficient to properly represent the underwater image degradation because there are unaccounted for and non-measurable factors e.g. scattering of light due to turbidity of water, reflective characteristics of turbid medium etc. We propose a deep learning-based architecture to automatically simulate the underwater effects where only a dehazing-like image formation equation is known to the network, and the additional degradation due to the other unknown factors if inferred in a data-driven way. We only use RGB images (because in real-time scenario depth image is not available) to estimate the depth image. For testing, we have proposed (due to the lack of real underwater image datasets) a complex image formation model/equation to manually generate images that resemble real underwater images (used as ground truth). However, only the classical image formation equation (the one used for image dehazing) is informed to the network. This mimics the fact that in a real scenario, the physics are never completely known and only simplified models are known. Thanks to the ground truth, generated by a complex image formation equation, we could successfully perform a qualitative and quantitative evaluation of proposed technique, compared to other purely data driven approaches

Via

Access Paper or Ask Questions

Meet JEANIE: a Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

Feb 07, 2024
Lei Wang, Jun Liu, Liang Zheng, Tom Gedeon, Piotr Koniusz

Video sequences exhibit significant nuisance variations (undesired effects) of speed of actions, temporal locations, and subjects' poses, leading to temporal-viewpoint misalignment when comparing two sets of frames or evaluating the similarity of two sequences. Thus, we propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE) for sequence pairs. In particular, we focus on 3D skeleton sequences whose camera and subjects' poses can be easily manipulated in 3D. We evaluate JEANIE on skeletal Few-shot Action Recognition (FSAR), where matching well temporal blocks (temporal chunks that make up a sequence) of support-query sequence pairs (by factoring out nuisance variations) is essential due to limited samples of novel classes. Given a query sequence, we create its several views by simulating several camera locations. For a support sequence, we match it with view-simulated query sequences, as in the popular Dynamic Time Warping (DTW). Specifically, each support temporal block can be matched to the query temporal block with the same or adjacent (next) temporal index, and adjacent camera views to achieve joint local temporal-viewpoint warping. JEANIE selects the smallest distance among matching paths with different temporal-viewpoint warping patterns, an advantage over DTW which only performs temporal alignment. We also propose an unsupervised FSAR akin to clustering of sequences with JEANIE as a distance measure. JEANIE achieves state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II on supervised and unsupervised FSAR, and their meta-learning inspired fusion.

* Under minor revision with IJCV. An extension of our ACCV'22 paper [arXiv:arXiv:2210.16820] which was distinguished by the Sang Uk Lee Best Student Paper Award. arXiv admin note: text overlap with arXiv:2112.12668

Via

Access Paper or Ask Questions

BirdNeRF: Fast Neural Reconstruction of Large-Scale Scenes From Aerial Imagery

Feb 07, 2024
Huiqing Zhang, Yifei Xue, Ming Liao, Yizhen Lao

In this study, we introduce BirdNeRF, an adaptation of Neural Radiance Fields (NeRF) designed specifically for reconstructing large-scale scenes using aerial imagery. Unlike previous research focused on small-scale and object-centric NeRF reconstruction, our approach addresses multiple challenges, including (1) Addressing the issue of slow training and rendering associated with large models. (2) Meeting the computational demands necessitated by modeling a substantial number of images, requiring extensive resources such as high-performance GPUs. (3) Overcoming significant artifacts and low visual fidelity commonly observed in large-scale reconstruction tasks due to limited model capacity. Specifically, we present a novel bird-view pose-based spatial decomposition algorithm that decomposes a large aerial image set into multiple small sets with appropriately sized overlaps, allowing us to train individual NeRFs of sub-scene. This decomposition approach not only decouples rendering time from the scene size but also enables rendering to scale seamlessly to arbitrarily large environments. Moreover, it allows for per-block updates of the environment, enhancing the flexibility and adaptability of the reconstruction process. Additionally, we propose a projection-guided novel view re-rendering strategy, which aids in effectively utilizing the independently trained sub-scenes to generate superior rendering results. We evaluate our approach on existing datasets as well as against our own drone footage, improving reconstruction speed by 10x over classical photogrammetry software and 50x over state-of-the-art large-scale NeRF solution, on a single GPU with similar rendering quality.

Via

Access Paper or Ask Questions

NITO: Neural Implicit Fields for Resolution-free Topology Optimization

Feb 07, 2024
Amin Heyrani Nobari, Giorgio Giannone, Lyle Regenwetter, Faez Ahmed

Topology optimization is a critical task in engineering design, where the goal is to optimally distribute material in a given space for maximum performance. We introduce Neural Implicit Topology Optimization (NITO), a novel approach to accelerate topology optimization problems using deep learning. NITO stands out as one of the first frameworks to offer a resolution-free and domain-agnostic solution in deep learning-based topology optimization. NITO synthesizes structures with up to seven times better structural efficiency compared to SOTA diffusion models and does so in a tenth of the time. In the NITO framework, we introduce a novel method, the Boundary Point Order-Invariant MLP (BPOM), to represent boundary conditions in a sparse and domain-agnostic manner, moving away from expensive simulation-based approaches. Crucially, NITO circumvents the domain and resolution limitations that restrict Convolutional Neural Network (CNN) models to a structured domain of fixed size -- limitations that hinder the widespread adoption of CNNs in engineering applications. This generalizability allows a single NITO model to train and generate solutions in countless domains, eliminating the need for numerous domain-specific CNNs and their extensive datasets. Despite its generalizability, NITO outperforms SOTA models even in specialized tasks, is an order of magnitude smaller, and is practically trainable at high resolutions that would be restrictive for CNNs. This combination of versatility, efficiency, and performance underlines NITO's potential to transform the landscape of engineering design optimization problems through implicit fields.

Via

Access Paper or Ask Questions

Admittance Controller Complemented with Real-time Singularity Avoidance for Rehabilitation Parallel Robots

Jan 17, 2024
Jose L. Pulloquinga, Rafael J. Escarabajal, Marina Valles, Miguel Diaz-Rodriguez, Vicente Mata, Angel Valera

Rehabilitation tasks demand robust and accurate trajectory-tracking performance, mainly achieved with parallel robots. In this field, limiting the value of the force exerted on the patient is crucial, especially when an injured limb is involved. In human-robot interaction studies, the admittance controller modifies the location of the robot according to the user efforts driving the end-effector to an arbitrary location within the workspace. However, a parallel robot has singularities within the workspace, making implementing a conventional admittance controller unsafe. Thus, this study proposes an admittance controller that overcomes the limitations of singular configurations by using a real-time singularity avoidance algorithm. The singularity avoidance algorithm modifies the original trajectory based on the actual location of the parallel robot. The complemented admittance controller is applied to a 4 degrees of freedom parallel robot for knee rehabilitation. In this case, the actual location is measured by a 3D tracking system because the location calculated by the forward kinematics is inaccurate in the vicinity of a singularity. The experimental results verify the effectiveness of the proposed admittance controller for safe knee rehabilitation exercises

Via

Access Paper or Ask Questions