Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine interaction may produce errors that are irrecoverable ex post. In the absence of automated input controls, manual data entry frequently generates irrecoverable categorical distortions that propagate into Key Performance Indicators (KPIs), thereby misleading managerial decision making. State of the art normalization tools typically evaluate semantic and morphological dimensions in isolation and rely heavily on standard dictionaries, rendering them ineffective for SME master data rich in custom SKUs, abbreviations, and domain-specific technical jargon. This paper introduces the Categorical Error Sensitivity Index (ISEC), an ordinal composite score designed to rank category pairs according to their structural susceptibility to confusion. ISEC integrates semantic distance (via word embeddings), custom weighted morphological transformation costs (through an adapted Damerau Levenshtein algorithm), and empirical frequency into a unified, mathematically robust preventive framework. By leveraging vector database architectures, ISEC reduces computational complexity, achieving approximately a 195x performance improvement over brute-force methods. Validated across three heterogeneous datasets: governmental judicial records, retail inventory, and a synthetic ISO coded metalworking catalog, ISEC provides a scalable and proactive data governance instrument that enables SMEs to detect latent structural risk embedded within their categorical data assets.
Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.
The rapid developments in artificial intelligence (AI) research in radiology have produced numerous models that are scattered across various platforms and sources, limiting discoverability, reproducibility and clinical translation. Herein, OpenRad was created, a curated, standardized, open-access repository that aggregates radiology AI models and providing details such as the availability of pretrained weights and interactive applications. Retrospective analysis of peer reviewed literature and preprints indexed in PubMed, arXiv and Scopus was performed until Dec 2025 (n = 5239 records). Model records were generated using a locally hosted LLM (gpt-oss:120b), based on the RSNA AI Roadmap JSON schema, and manually verified by ten expert reviewers. Stability of LLM outputs was assessed on 225 randomly selected papers using text similarity metrics. A total of 1694 articles were included after review. Included models span all imaging modalities (CT, MRI, X-ray, US) and radiology subspecialties. Automated extraction demonstrated high stability for structured fields (Levenshtein ratio > 90%), with 78.5% of record edits being characterized as minor during expert review. Statistical analysis of the repository revealed CNN and transformer architectures as dominant, while MRI was the most commonly used modality (in 621 neuroradiology AI models). Research output was mostly concentrated in China and the United States. The OpenRad web interface enables model discovery via keyword search and filters for modality, subspecialty, intended use, verification status and demo availability, alongside live statistics. The community can contribute new models through a dedicated portal. OpenRad contains approx. 1700 open access, curated radiology AI models with standardized metadata, supplemented with analysis of code repositories, thereby creating a comprehensive, searchable resource for the radiology community.
Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.
This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.




Turning static slides into engaging video lectures takes considerable time and effort, requiring presenters to record explanations and visually guide their audience through the material. We introduce an end-to-end system designed to automate this process entirely. Given a slide deck, this system synthesizes a video lecture featuring AI-generated narration synchronized precisely with dynamic visual highlights. These highlights automatically draw attention to the specific concept being discussed, much like an effective presenter would. The core technical contribution is a novel highlight alignment module. This module accurately maps spoken phrases to locations on a given slide using diverse strategies (e.g., Levenshtein distance, LLM-based semantic analysis) at selectable granularities (line or word level) and utilizes timestamp-providing Text-to-Speech (TTS) for timing synchronization. We demonstrate the system's effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples, finding that LLM-based alignment achieves high location accuracy (F1 > 92%), significantly outperforming simpler methods, especially on complex, math-heavy content. Furthermore, the calculated generation cost averages under $1 per hour of video, offering potential savings of two orders of magnitude compared to conservative estimates of manual production costs. This combination of high accuracy and extremely low cost positions this approach as a practical and scalable tool for transforming static slides into effective, visually-guided video lectures.




Numbers are a basic part of how humans represent and describe the world around them. As a consequence, learning effective representations of numbers is critical for the success of large language models as they become more integrated into everyday decisions. However, these models face a challenge: depending on context, the same sequence of digit tokens, e.g., 911, can be treated as a number or as a string. What kind of representations arise from this duality, and what are its downstream implications? Using a similarity-based prompting technique from cognitive science, we show that LLMs learn representational spaces that blend string-like and numerical representations. In particular, we show that elicited similarity judgments from these models over integer pairs can be captured by a combination of Levenshtein edit distance and numerical Log-Linear distance, suggesting an entangled representation. In a series of experiments we show how this entanglement is reflected in the latent embeddings, how it can be reduced but not entirely eliminated by context, and how it can propagate into a realistic decision scenario. These results shed light on a representational tension in transformer models that must learn what a number is from text input.




Retrieval-augmented machine translation leverages examples from a translation memory by retrieving similar instances. These examples are used to condition the predictions of a neural decoder. We aim to improve the upstream retrieval step and consider a fixed downstream edit-based model: the multi-Levenshtein Transformer. The task consists of finding a set of examples that maximizes the overall coverage of the source sentence. To this end, we rely on the theory of submodular functions and explore new algorithms to optimize this coverage. We evaluate the resulting performance gains for the machine translation task.




Bangla is the seventh most spoken language by a total number of speakers in the world, and yet the development of an automated grammar checker in this language is an understudied problem. Bangla grammatical error detection is a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant. Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models. Finally, we combine the output of these models and apply rule-based post-processing to generate a more reliable and comprehensive result. Our system is evaluated on a dataset consisting of over 25,000 texts from various sources. Our best model achieves a Levenshtein distance score of 1.04. Finally, we provide a detailed analysis of different components of our system.




Levenshtein transformer (LevT) is a non-autoregressive machine translation model with high decoding efficiency and comparable translation quality in terms of bleu score, due to its parallel decoding and iterative refinement procedure. Are there any deficiencies of its translations and what improvements could be made? In this report, we focus on LevT's decoder and analyse the decoding results length, subword generation, and deletion module's capability. We hope to identify weaknesses of the decoder for future improvements. We also compare translations of the original LevT, knowledge-distilled LevT, LevT with translation memory, and the KD-LevT with translation memory to see how KD and translation memory can help.