Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
In this paper, we propose a robust real time detection and tracking method for detecting ships in a coastal video sequences. Since coastal scenarios are unpredictable and scenes have dynamic properties it is essential to apply detection methods that are robust to these conditions. This paper presents modified ViBe for moving object detection which detects ships and backwash. In the modified ViBe the probability of losing ships is decreased in comparison with the original ViBe. It is robust to natural sea waves and variation of lights and is capable of quickly updating the background. Based on geometrical properties of ship and some concepts such as brightness distortion, a new method for backwash cancellation is proposed. Experimental results demonstrate that the proposed strategy and methods have outstanding performance in ship detection and tracking. These results also illustrate real time and precise performance of the proposed strategy.
Label assignment is a critical component in training dense object detectors. State-of-the-art methods typically assign each training sample a positive and a negative weight, optimizing the assignment scheme during training. However, these strategies often assign an insufficient number of positive samples to small objects, leading to a scale imbalance during training. To address this limitation, we introduce RFAssigner, a novel assignment strategy designed to enhance the multi-scale learning capabilities of dense detectors. RFAssigner first establishes an initial set of positive samples using a point-based prior. It then leverages a Gaussian Receptive Field (GRF) distance to measure the similarity between the GRFs of unassigned candidate locations and the ground-truth objects. Based on this metric, RFAssigner adaptively selects supplementary positive samples from the unassigned pool, promoting a more balanced learning process across object scales. Comprehensive experiments on three datasets with distinct object scale distributions validate the effectiveness and generalizability of our method. Notably, a single FCOS-ResNet-50 detector equipped with RFAssigner achieves state-of-the-art performance across all object scales, consistently outperforming existing strategies without requiring auxiliary modules or heuristics.
Arrhythmogenic right ventricular cardiomyopathy (ARVC) and long QT syndrome (LQTS) are inherited arrhythmia syndromes associated with sudden cardiac death. Deep learning shows promise for ECG interpretation, but multi-class inherited arrhythmia classification with clinically grounded interpretability remains underdeveloped. Our objective was to develop and validate a lead-aware deep learning framework for multi-class (ARVC vs LQTS vs control) and binary inherited arrhythmia classification, and to determine optimal strategies for integrating ECG foundation models within arrhythmia screening tools. We assembled a 13-center Canadian cohort (645 patients; 1,344 ECGs). We evaluated four ECG foundation models using three transfer learning approaches: linear probing, fine-tuning, and combined strategies. We developed lead-aware spatial attention networks (LASAN) and assessed integration strategies combining LASAN with foundation models. Performance was compared against the established foundation model baselines. Lead-group masking quantified disease-specific lead dependence. Fine-tuning outperformed linear probing and combined strategies across all foundation models (mean macro-AUROC 0.904 vs 0.825). The best lead-aware integrations achieved near-ceiling performance (HuBERT-ECG hybrid: macro-AUROC 0.990; ARVC vs control AUROC 0.999; LQTS vs control AUROC 0.994). Lead masking demonstrated physiologic plausibility: V1-V3 were most critical for ARVC detection (4.54% AUROC reduction), while lateral leads were preferentially important for LQTS (2.60% drop). Lead-aware architectures achieved state-of-the-art performance for inherited arrhythmia classification, outperforming all existing published models on both binary and multi-class tasks while demonstrating clinically aligned lead dependence. These findings support potential utility for automated ECG screening pending validation.
The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.
This paper presents a comparative study of a custom convolutional neural network (CNN) architecture against widely used pretrained and transfer learning CNN models across five real-world image datasets. The datasets span binary classification, fine-grained multiclass recognition, and object detection scenarios. We analyze how architectural factors, such as network depth, residual connections, and feature extraction strategies, influence classification and localization performance. The results show that deeper CNN architectures provide substantial performance gains on fine-grained multiclass datasets, while lightweight pretrained and transfer learning models remain highly effective for simpler binary classification tasks. Additionally, we extend the proposed architecture to an object detection setting, demonstrating its adaptability in identifying unauthorized auto-rickshaws in real-world traffic scenes. Building upon a systematic analysis of custom CNN architectures alongside pretrained and transfer learning models, this study provides practical guidance for selecting suitable network designs based on task complexity and resource constraints.
We propose a maturity-based framework for certifying embodied AI systems through explicit measurement mechanisms. We argue that certifiable embodied AI requires structured assessment frameworks, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs inherent in trustworthiness evaluation. We demonstrate this approach using uncertainty quantification as an exemplar measurement mechanism and illustrate feasibility through an Uncrewed Aircraft System (UAS) detection case study.
Automated analysis of volumetric medical imaging on edge devices is severely constrained by the high memory and computational demands of 3D Convolutional Neural Networks (CNNs). This paper develops a lightweight computer vision framework that reconciles the efficiency of 2D detection with the necessity of 3D context by reformulating volumetric Computer Tomography (CT) data as sequential video streams. This video-viewpoint paradigm is applied to the time-sensitive task of Intracranial Hemorrhage (ICH) detection using the Hemorica dataset. To ensure operational efficiency, we benchmarked multiple generations of the YOLO architecture (v8, v10, v11 and v12) in their Nano configurations, selecting the version with the highest mAP@50 to serve as the slice-level backbone. A ByteTrack algorithm is then introduced to enforce anatomical consistency across the $z$-axis. To address the initialization lag inherent in video trackers, a hybrid inference strategy and a spatiotemporal consistency filter are proposed to distinguish true pathology from transient prediction noise. Experimental results on independent test data demonstrate that the proposed framework serves as a rigorous temporal validator, increasing detection Precision from 0.703 to 0.779 compared to the baseline 2D detector, while maintaining high sensitivity. By approximating 3D contextual reasoning at a fraction of the computational cost, this method provides a scalable solution for real-time patient prioritization in resource-constrained environments, such as mobile stroke units and IoT-enabled remote clinics.
Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadside-only data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle--roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.
The primary objective of a diverse planning approach is to generate a set of plans that are distinct from one another. Such an approach is applied in a variety of real-world domains, including risk management, automated stream data analysis, and malware detection. More recently, a novel diverse planning paradigm, referred to as behaviour planning, has been proposed. This approach extends earlier methods by explicitly incorporating a diversity model into the planning process and supporting multiple planning categories. In this paper, we demonstrate the usefulness of behaviour planning in real-world settings by presenting three case studies. The first case study focuses on storytelling, the second addresses urban planning, and the third examines game evaluation.