Abstract:Archaeologists, as well as specialists and practitioners in cultural heritage, require applications with additional functions, such as the annotation and attachment of metadata to specific regions of the 3D digital artifacts, to go beyond the simplistic three-dimensional (3D) visualization. Different strategies addressed this issue, most of which are excellent in their particular area of application, but their capacity is limited to their design's purpose; they lack generalization and interoperability. This paper introduces ART3mis, a general-purpose, user-friendly, feature-rich, interactive web-based textual annotation tool for 3D objects. Moreover, it enables the communication, distribution, and reuse of information as it complies with the W3C Web Annotation Data Model. It is primarily designed to help cultural heritage conservators, restorers, and curators who lack technical expertise in 3D imaging and graphics, handle, segment, and annotate 3D digital replicas of artifacts with ease.
Abstract:Beyond simplistic 3D visualisations, archaeologists, as well as cultural heritage experts and practitioners, need applications with advanced functionalities. Such as the annotation and attachment of metadata onto particular regions of the 3D digital objects. Various approaches have been presented to tackle this challenge, most of which achieve excellent results in the domain of their application. However, they are often confined to that specific domain and particular problem. In this paper, we present ART3mis - a general-purpose, user-friendly, interactive textual annotation tool for 3D objects. Primarily attuned to aid cultural heritage conservators, restorers and curators with no technical skills in 3D imaging and graphics, the tool allows for the easy handling, segmenting and annotating of 3D digital replicas of artefacts. ART3mis applies a user-driven, direct-on-surface approach. It can handle detailed 3D cultural objects in real-time and store textual annotations for multiple complex regions in JSON data format.
Abstract:Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
Abstract:Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
Abstract:Modern deep-learning training is not memoryless. Updates depend on optimizer moments and averaging, data-order policies (random reshuffling vs with-replacement, staged augmentations and replay), the nonconvex path, and auxiliary state (teacher EMA/SWA, contrastive queues, BatchNorm statistics). This survey organizes mechanisms by source, lifetime, and visibility. It introduces seed-paired, function-space causal estimands; portable perturbation primitives (carry/reset of momentum/Adam/EMA/BN, order-window swaps, queue/teacher tweaks); and a reporting checklist with audit artifacts (order hashes, buffer/BN checksums, RNG contracts). The conclusion is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data, and regimes.
Abstract:Deep networks learn internal representations whose geometry--how features bend, rotate, and evolve--affects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the "twist" accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy increases with loop radius, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.
Abstract:This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $Δ_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings. "Data order matters" turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.
Abstract:Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the training of CB for ASR with correlation plots between the bias embeddings among various training stages. Secondly, we introduce a neighbourhood attention (NA) that localizes self attention (SA) to the nearest neighbouring frames to further refine the CB output. The results show that this proposed approach provides on average a 25.84% relative WER improvement on LibriSpeech sets and rare-word evaluation compared to the baseline.


Abstract:The estimation of depth in two-dimensional images has long been a challenging and extensively studied subject in computer vision. Recently, significant progress has been made with the emergence of Deep Learning-based approaches, which have proven highly successful. This paper focuses on the explainability in monocular depth estimation methods, in terms of how humans perceive depth. This preliminary study emphasizes on one of the most significant visual cues, the relative size, which is prominent in almost all viewed images. We designed a specific experiment to mimic the experiments in humans and have tested state-of-the-art methods to indirectly assess the explainability in the context defined. In addition, we observed that measuring the accuracy required further attention and a particular approach is proposed to this end. The results show that a mean accuracy of around 77% across methods is achieved, with some of the methods performing markedly better, thus, indirectly revealing their corresponding potential to uncover monocular depth cues, like relative size.




Abstract:Black spot identification, a spatiotemporal phenomenon, involves analyzing the geographical location and time-based occurrence of road accidents. Typically, this analysis examines specific locations on road networks during set time periods to pinpoint areas with a higher concentration of accidents, known as black spots. By evaluating these problem areas, researchers can uncover the underlying causes and reasons for increased collision rates, such as road design, traffic volume, driver behavior, weather, and infrastructure. However, challenges in identifying black spots include limited data availability, data quality, and assessing contributing factors. Additionally, evolving road design, infrastructure, and vehicle safety technology can affect black spot analysis and determination. This study focused on traffic accidents in Greek road networks to recognize black spots, utilizing data from police and government-issued car crash reports. The study produced a publicly available dataset called Black Spots of North Greece (BSNG) and a highly accurate identification method.