Unsupervised Out-of-Distribution (OOD) detection consists in identifying anomalous regions in images leveraging only models trained on images of healthy anatomy. An established approach is to tokenize images and model the distribution of tokens with Auto-Regressive (AR) models. AR models are used to 1) identify anomalous tokens and 2) in-paint anomalous representations with in-distribution tokens. However, AR models are slow at inference time and prone to error accumulation issues which negatively affect OOD detection performance. Our novel method, MIM-OOD, overcomes both speed and error accumulation issues by replacing the AR model with two task-specific networks: 1) a transformer optimized to identify anomalous tokens and 2) a transformer optimized to in-paint anomalous tokens using masked image modelling (MIM). Our experiments with brain MRI anomalies show that MIM-OOD substantially outperforms AR models (DICE 0.458 vs 0.301) while achieving a nearly 25x speedup (9.5s vs 244s).
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
Primary care professionals struggle to keep up to date with the latest scientific literature critical in guiding evidence-based practice related to their daily work. To help solve the above-mentioned problem, we employed generative artificial intelligence techniques based on large-scale language models to summarize abstracts of scientific papers. Our objective is to investigate the potential of generative artificial intelligence in diminishing the cognitive load experienced by practitioners, thus exploring its ability to alleviate mental effort and burden. The study participants were provided with two use cases related to preventive care and behavior change, simulating a search for new scientific literature. The study included 113 university students from Slovenia and the United States randomized into three distinct study groups. The first group was assigned to the full abstracts. The second group was assigned to the short abstracts generated by AI. The third group had the option to select a full abstract in addition to the AI-generated short summary. Each use case study included ten retrieved abstracts. Our research demonstrates that the use of generative AI for literature review is efficient and effective. The time needed to answer questions related to the content of abstracts was significantly lower in groups two and three compared to the first group using full abstracts. The results, however, also show significantly lower accuracy in extracted knowledge in cases where full abstract was not available. Such a disruptive technology could significantly reduce the time required for healthcare professionals to keep up with the most recent scientific literature; nevertheless, further developments are needed to help them comprehend the knowledge accurately.
Computing in the network (COIN) is a promising technology that allows processing to be carried out within network devices such as switches and network interface cards. Time sensitive application can achieve their quality of service (QoS) target by flexibly distributing the caching and computing tasks in the cloud-edge-mist continuum. This paper highlights the advantages of in-network computing, comparing to edge computing, in terms of latency and traffic filtering. We consider a critical use case related to Covid-19 alert application in an airport setting. Arriving travelers are monitored through cough analysis so that potentially infected cases can be detected and isolated for medical tests. A performance comparison has been done between an architecture using in-network computing and another one using edge computing. We show using simulations that in-network computing outperforms edge computing in terms of Round Trip Time (RTT) and traffic filtering.
We have developed a methodology for the systematic generation of a large image dataset of macerated wood references, which we used to generate image data for nine hardwood genera. This is the basis for a substantial approach to automate, for the first time, the identification of hardwood species in microscopic images of fibrous materials by deep learning. Our methodology includes a flexible pipeline for easy annotation of vessel elements. We compare the performance of different neural network architectures and hyperparameters. Our proposed method performs similarly well to human experts. In the future, this will improve controls on global wood fiber product flows to protect forests.
Cortical surface reconstruction plays a fundamental role in modeling the rapid brain development during the perinatal period. In this work, we propose Conditional Temporal Attention Network (CoTAN), a fast end-to-end framework for diffeomorphic neonatal cortical surface reconstruction. CoTAN predicts multi-resolution stationary velocity fields (SVF) from neonatal brain magnetic resonance images (MRI). Instead of integrating multiple SVFs, CoTAN introduces attention mechanisms to learn a conditional time-varying velocity field (CTVF) by computing the weighted sum of all SVFs at each integration step. The importance of each SVF, which is estimated by learned attention maps, is conditioned on the age of the neonates and varies with the time step of integration. The proposed CTVF defines a diffeomorphic surface deformation, which reduces mesh self-intersection errors effectively. It only requires 0.21 seconds to deform an initial template mesh to cortical white matter and pial surfaces for each brain hemisphere. CoTAN is validated on the Developing Human Connectome Project (dHCP) dataset with 877 3D brain MR images acquired from preterm and term born neonates. Compared to state-of-the-art baselines, CoTAN achieves superior performance with only 0.12mm geometric error and 0.07% self-intersecting faces. The visualization of our attention maps illustrates that CoTAN indeed learns coarse-to-fine surface deformations automatically without intermediate supervision.
For approximate nearest neighbor search, graph-based algorithms have shown to offer the best trade-off between accuracy and search time. We propose the Dynamic Exploration Graph (DEG) which significantly outperforms existing algorithms in terms of search and exploration efficiency by combining two new ideas: First, a single undirected even regular graph is incrementally built by partially replacing existing edges to integrate new vertices and to update old neighborhoods at the same time. Secondly, an edge optimization algorithm is used to continuously improve the quality of the graph. Combining this ongoing refinement with the graph construction process leads to a well-organized graph structure at all times, resulting in: (1) increased search efficiency, (2) predictable index size, (3) guaranteed connectivity and therefore reachability of all vertices, and (4) a dynamic graph structure. In addition we investigate how well existing graph-based search systems can handle indexed queries where the seed vertex of a search is the query itself. Such exploration tasks, despite their good starting point, are not necessarily easy. High efficiency in approximate nearest neighbor search (ANNS) does not automatically imply good performance in exploratory search. Extensive experiments show that our new Dynamic Exploration Graph outperforms existing algorithms significantly for indexed and unindexed queries.
Channel estimation (CE) plays a key role in reconfigurable intelligent surface (RIS)-aided multiple-input multiple-output (MIMO) communication systems, while it poses a challenging task due to the passive nature of RIS and the cascaded channel structures. In this paper, a partially decoupled atomic norm minimization (PDANM) framework is proposed for CE of RIS-aided MIMO systems, which exploits the three-dimensional angular sparsity of the channel. In particular, PDANM partially decouples the differential angles at the RIS from other angles at the base station and user equipment, reducing the computational complexity compared with existing methods. A reweighted PDANM (RPDANM) algorithm is proposed to further improve CE accuracy, which iteratively refines CE through a specifically designed reweighing strategy. Building upon RPDANM, we propose an iterative approach named RPDANM with adaptive phase control (RPDANM-APC), which adaptively adjusts the RIS phases based on previously estimated channel parameters to facilitate CE, achieving superior CE accuracy while reducing training overhead. Numerical simulations demonstrate the superiority of our proposed approaches in terms of running time, CE accuracy, and training overhead. In particular, the RPDANM-APC approach can achieve higher CE accuracy than existing methods within less than 40 percent training overhead while reducing the running time by tens of times.
The batch size is an essential parameter to tune during the development of new neural networks. Amongst other quality indicators, it has a large degree of influence on the model's accuracy, generalisability, training times and parallelisability. This fact is generally known and commonly studied. However, during the application phase of a deep learning model, when the model is utilised by an end-user for inference, we find that there is a disregard for the potential benefits of introducing a batch size. In this study, we examine the effect of input batching on the energy consumption and response times of five fully-trained neural networks for computer vision that were considered state-of-the-art at the time of their publication. The results suggest that batching has a significant effect on both of these metrics. Furthermore, we present a timeline of the energy efficiency and accuracy of neural networks over the past decade. We find that in general, energy consumption rises at a much steeper pace than accuracy and question the necessity of this evolution. Additionally, we highlight one particular network, ShuffleNetV2(2018), that achieved a competitive performance for its time while maintaining a much lower energy consumption. Nevertheless, we highlight that the results are model dependent.
Video question--answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image--text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at https://github.com/declare-lab/sas-vqa.