Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.
This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $α$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
Unsupervised domain adaptation for object detection addresses the adaption of detectors trained in a source domain to work accurately in an unseen target domain. Recently, methods approaching the alignment of the intermediate features proven to be promising, achieving state-of-the-art results. However, these methods are laborious to implement and hard to interpret. Although promising, there is still room for improvements to close the performance gap toward the upper-bound (when training with the target data). In this work, we propose a method to generate an artificial dataset in the target domain to train an object detector. We employed two unsupervised image translators (CycleGAN and an AdaIN-based model) using only annotated data from the source domain and non-annotated data from the target domain. Our key contributions are the proposal of a less complex yet more effective method that also has an improved interpretability. Results on real-world scenarios for autonomous driving show significant improvements, outperforming state-of-the-art methods in most cases, further closing the gap toward the upper-bound.
The growing number of differently-abled and elderly individuals demands affordable, intelligent wheelchairs that combine safe navigation with health monitoring. Traditional wheelchairs lack dynamic features, and many smart alternatives remain costly, single-modality, and limited in health integration. Motivated by the pressing demand for advanced, personalized, and affordable assistive technologies, we propose a comprehensive AI-IoT based smart wheelchair system that incorporates glove-based gesture control for hands-free navigation, real-time object detection using YOLOv8 with auditory feedback for obstacle avoidance, and ultrasonic for immediate collision avoidance. Vital signs (heart rate, SpO$_2$, ECG, temperature) are continuously monitored, uploaded to ThingSpeak, and trigger email alerts for critical conditions. Built on a modular and low-cost architecture, the gesture control achieved a 95.5\% success rate, ultrasonic obstacle detection reached 94\% accuracy, and YOLOv8-based object detection delivered 91.5\% Precision, 90.2\% Recall, and a 90.8\% F1-score. This integrated, multi-modal approach offers a practical, scalable, and affordable solution, significantly enhancing user autonomy, safety, and independence by bridging the gap between innovative research and real-world deployment.
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.
Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.
In disaster scenarios, ensuring both reliable communication and situational awareness becomes a critical challenge due to the partial or complete collapse of terrestrial networks. This paper proposes an integrated sensing and communication (ISAC) over non-terrestrial networks (NTN) architecture referred to as ISAC-over-NTN that integrates multiple uncrewed aerial vehicles (UAVs) and a high-altitude platform station (HAPS) to maintain resilient and reliable network operations in post-disaster conditions. We aim to achieve two main objectives: i) provide a reliable communication infrastructure, thereby ensuring the continuity of search-and-rescue activities and connecting people to their loved ones, and ii) detect users, such as those trapped under rubble or those who are mobile, using a Doppler-based mobility detection model. We employ an innovative beamforming method that simultaneously transmits data and detects Doppler-based mobility by integrating multi-user multiple-input multiple-output (MU-MIMO) communication and monostatic sensing within the same transmission chain. The results show that the proposed framework maintains reliable connectivity and achieves high detection accuracy of users in critical locations, reaching 90% motion detection sensitivity and 88% detection accuracy.