Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
Key doctrines, including novelty (patent), originality (copyright), and distinctiveness (trademark), turn on a shared empirical question: whether a body of work is meaningfully distinct from a relevant reference class. Yet analyses typically operationalize this set-level inquiry using item-level evidence: pairwise comparisons among exemplars. That unit-of-analysis mismatch may be manageable for finite corpora of human-created works, where it can be bridged by ad hoc aggregations. But it becomes acute for machine-generated works, where the object of evaluation is not a fixed set of works but a generative process with an effectively unbounded output space. We propose a distributional alternative: a two-sample test based on maximum mean discrepancy computed on semantic embeddings to determine if two creative processes-whether human or machine-produce statistically distinguishable output distributions. The test requires no task-specific training-obviating the need for discovery of proprietary training data to characterize the generative process-and is sample-efficient, often detecting differences with as few as 5-10 images and 7-20 texts. We validate the framework across three domains: handwritten digits (controlled images), patent abstracts (text), and AI-generated art (real-world images). We reveal a perceptual paradox: even when human evaluators distinguish AI outputs from human-created art with only about 58% accuracy, our method detects distributional distinctiveness. Our results present evidence contrary to the view that generative models act as mere regurgitators of training data. Rather than producing outputs statistically indistinguishable from a human baseline-as simple regurgitation would predict-they produce outputs that are semantically human-like yet stochastically distinct, suggesting their dominant function is as a semantic interpolator within a learned latent space.
In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.
The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.
Background: In medical and health professions education (HPE), AI is increasingly used to assess clinical competencies, including via virtual standardized patients. However, most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores. This leaves robustness uncertain and can expose learners to misguidance from unvalidated systems. We address this by using AI "simulated learners" to stress-test and psychometrically characterize assessment pipelines before human use. Objective: Develop an open-source AI virtual patient platform and measurement model for robust competency evaluation across cases and rating conditions. Methods: We built a platform with virtual patients, virtual learners with tunable ACGME-aligned competency profiles, and multiple independent AI raters scoring encounters with structured Key-Features items. Transcripts were analyzed with a Bayesian HRM-SDT model that treats ratings as decisions under uncertainty and separates learner ability, case performance, and rater behavior; parameters were estimated with MCMC. Results: The model recovered simulated learners' competencies, with significant correlations to the generating competencies across all ACGME domains despite a non-deterministic pipeline. It estimated case difficulty by competency and showed stable rater detection (sensitivity) and criteria (severity/leniency thresholds) across AI raters using identical models/prompts but different seeds. We also propose a staged "safety blueprint" for deploying AI tools with learners, tied to entrustment-based validation milestones. Conclusions: Combining a purpose-built virtual patient platform with a principled psychometric model enables robust, interpretable, generalizable competency estimates and supports validation of AI-assisted assessment prior to use with human learners.
The study investigates the juridico-technological architecture of international public health instruments, focusing on their implementation across India, the European Union, the United States and low- and middle-income countries (LMICs), particularly in Sub-Saharan Africa. It addresses a research lacuna: the insufficient harmonisation between normative health law and algorithmic public health infrastructures in resource-constrained jurisdictions. The principal objective is to assess how artificial intelligence augments implementation of instruments grounded in IHR 2005 and the WHO FCTC while identifying doctrinal and infrastructural bottlenecks. Using comparative doctrinal analysis and legal-normative mapping, the study triangulates legislative instruments, WHO monitoring frameworks, AI systems including BlueDot, Aarogya Setu and EIOS, and compliance metrics. Preliminary results show that AI has improved early detection, surveillance precision and responsiveness in high-capacity jurisdictions, whereas LMICs face infrastructural deficits, data privacy gaps and fragmented legal scaffolding. The findings highlight the relevance of the EU Artificial Intelligence Act and GDPR as regulatory prototypes for health-oriented algorithmic governance and contrast them with embryonic AI integration and limited internet penetration in many LMICs. The study argues for embedding AI within a rights-compliant, supranationally coordinated regulatory framework to secure equitable health outcomes and stronger compliance. It proposes a model for algorithmic treaty-making inspired by FCTC architecture and calls for WHO-led compliance mechanisms modelled on the WTO Dispute Settlement Body to enhance pandemic preparedness, surveillance equity and transnational governance resilience.
The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. We introduce Forest-Chat, an LLM-driven agent designed for integrated forest change analysis. The proposed framework enables natural language querying and supports multiple RSICI tasks, including change detection, change captioning, object counting, deforestation percentage estimation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, and incorporates zero-shot change detection via a foundation change detection model together with an interactive point-prompt interface to support fine-grained user guidance. To facilitate adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated through a combination of human annotation and rule-based methods. Experimental results demonstrate that Forest-Chat achieves strong performance on Forest-Change and on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI, for joint change detection and captioning, highlighting the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and analytical efficiency in forest change analysis.
Unsupervised domain adaptation for object detection addresses the adaption of detectors trained in a source domain to work accurately in an unseen target domain. Recently, methods approaching the alignment of the intermediate features proven to be promising, achieving state-of-the-art results. However, these methods are laborious to implement and hard to interpret. Although promising, there is still room for improvements to close the performance gap toward the upper-bound (when training with the target data). In this work, we propose a method to generate an artificial dataset in the target domain to train an object detector. We employed two unsupervised image translators (CycleGAN and an AdaIN-based model) using only annotated data from the source domain and non-annotated data from the target domain. Our key contributions are the proposal of a less complex yet more effective method that also has an improved interpretability. Results on real-world scenarios for autonomous driving show significant improvements, outperforming state-of-the-art methods in most cases, further closing the gap toward the upper-bound.
Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.
The growing number of differently-abled and elderly individuals demands affordable, intelligent wheelchairs that combine safe navigation with health monitoring. Traditional wheelchairs lack dynamic features, and many smart alternatives remain costly, single-modality, and limited in health integration. Motivated by the pressing demand for advanced, personalized, and affordable assistive technologies, we propose a comprehensive AI-IoT based smart wheelchair system that incorporates glove-based gesture control for hands-free navigation, real-time object detection using YOLOv8 with auditory feedback for obstacle avoidance, and ultrasonic for immediate collision avoidance. Vital signs (heart rate, SpO$_2$, ECG, temperature) are continuously monitored, uploaded to ThingSpeak, and trigger email alerts for critical conditions. Built on a modular and low-cost architecture, the gesture control achieved a 95.5\% success rate, ultrasonic obstacle detection reached 94\% accuracy, and YOLOv8-based object detection delivered 91.5\% Precision, 90.2\% Recall, and a 90.8\% F1-score. This integrated, multi-modal approach offers a practical, scalable, and affordable solution, significantly enhancing user autonomy, safety, and independence by bridging the gap between innovative research and real-world deployment.
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.