Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
As the population ages rapidly, long-term care (LTC) facilities across North America face growing pressure to monitor residents safely while keeping staff workload manageable. Falls are among the most critical events to monitor due to their timely response requirement, yet frequent false alarms or uncertain detections can overwhelm caregivers and contribute to alarm fatigue. This motivates the design of reliable, whole end-to-end ambient monitoring systems from occupancy and activity awareness to fall and post-fall detection. In this paper, we focus on robust post-fall floor-occupancy detection using an off-the-shelf 60 GHz FMCW radar and evaluate its deployment in a realistic, furnished indoor environment representative of LTC facilities. Post-fall detection is challenging since motion is minimal, and reflections from the floor and surrounding objects can dominate the radar signal return. We compare a vendor-provided digital beamforming (DBF) pipeline against a proposed preprocessing approach based on Capon or minimum variance distortionless response (MVDR) beamforming. A cell-averaging constant false alarm rate (CA-CFAR) detector is applied and evaluated on the resulting range-azimuth maps across 7 participants. The proposed method improves the mean frame-positive rate from 0.823 (DBF) to 0.916 (Proposed).
The scientific peer-review process is facing a shortage of human resources due to the rapid growth in the number of submitted papers. The use of language models to reduce the human cost of peer review has been actively explored as a potential solution to this challenge. A method has been proposed to evaluate the level of substantiation in scientific reviews in a manner that is interpretable by humans. This method extracts the core components of an argument, claims and evidence, and assesses the level of substantiation based on the proportion of claims supported by evidence. The level of substantiation refers to the extent to which claims are based on objective facts. However, when assessing the level of substantiation, simply detecting the presence or absence of supporting evidence for a claim is insufficient; it is also necessary to accurately assess the logical inference between a claim and its evidence. We propose a new evaluation metric for scientific review comments that assesses the logical inference between claims and evidence. Experimental results show that the proposed method achieves a higher correlation with human scores than conventional methods, indicating its potential to better support the efficiency of the peer-review process.
Autonomous robotic systems are widely deployed in smart factories and operate in dynamic, uncertain, and human-involved environments that require low-latency and robust fault detection and recovery (FDR). However, existing FDR frameworks exhibit various limitations, such as significant delays in communication and computation, and unreliability in robot motion/trajectory generation, mainly because the communication-computation-control (3C) loop is designed without considering the downstream FDR goal. To address this, we propose a novel Goal-oriented Communication (GoC) framework that jointly designs the 3C loop tailored for fast and robust robotic FDR, with the goal of minimising the FDR time while maximising the robotic task (e.g., workpiece sorting) success rate. For fault detection, our GoC framework innovatively defines and extracts the 3D scene graph (3D-SG) as the semantic representation via our designed representation extractor, and detects faults by monitoring spatial relationship changes in the 3D-SG. For fault recovery, we fine-tune a small language model (SLM) via Low-Rank Adaptation (LoRA) and enhance its reasoning and generalization capabilities via knowledge distillation to generate recovery motions for robots. We also design a lightweight goal-oriented digital twin reconstruction module to refine the recovery motions generated by the SLM when fine-grained robotic control is required, using only task-relevant object contours for digital twin reconstruction. Extensive simulations demonstrate that our GoC framework reduces the FDR time by up to 82.6% and improves the task success rate by up to 76%, compared to the state-of-the-art frameworks that rely on vision language models for fault detection and large language models for fault recovery.
Photoacoustic computed tomography (PACT) is a promising imaging modality that combines the advantages of optical contrast with ultrasound detection. Utilizing ultrasound transducers with larger surface areas can improve detection sensitivity. However, when computationally efficient analytic reconstruction methods that neglect the spatial impulse responses (SIRs) of the transducer are employed, the spatial resolution of the reconstructed images will be compromised. Although optimization-based reconstruction methods can explicitly account for SIR effects, their computational cost is generally high, particularly in three-dimensional (3D) applications. To address the need for accurate but rapid 3D PACT image reconstruction, this study presents a framework for establishing a learned SIR compensation method that operates in the data domain. The learned compensation method maps SIR-corrupted PACT measurement data to compensated data that would have been recorded by idealized point-like transducers. Subsequently, the compensated data can be used with a computationally efficient reconstruction method that neglects SIR effects. Two variants of the learned compensation model are investigated that employ a U-Net model and a specifically designed, physics-inspired model, referred to as Deconv-Net. A fast and analytical training data generation procedure is also a component of the presented framework. The framework is rigorously validated in virtual imaging studies, demonstrating resolution improvement and robustness to noise variations, object complexity, and sound speed heterogeneity. When applied to in-vivo breast imaging data, the learned compensation models revealed fine structures that had been obscured by SIR-induced artifacts. To our knowledge, this is the first demonstration of learned SIR compensation in 3D PACT imaging.
Intelligent surveillance systems often handle perceptual tasks such as object detection, facial recognition, and emotion analysis independently, but they lack a unified, adaptive runtime scheduler that dynamically allocates computational resources based on contextual triggers. This limits their holistic understanding and efficiency on low-power edge devices. To address this, we present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform. The core of our system is an adaptive scheduling mechanism that reduces computational load by 65\% compared to continuous processing by selectively activating modules such as, YOLOv8n for object detection, a custom FaceNet-based embedding system for facial recognition, and DeepFace's CNN for emotion classification. Experimental results demonstrate the system's efficacy, with the object detection module achieving an Average Precision (AP) of 0.861, facial recognition attaining 88\% accuracy, and emotion detection showing strong discriminatory power (AUC up to 0.97 for specific emotions), while operating at 5.6 frames per second. Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware, making intelligent perception more accessible and privacy-preserving.
A key challenge in employing data, algorithms and data-driven systems is to adhere to the principle of fairness and justice. Statistical fairness measures belong to an important category of technical/formal mechanisms for detecting fairness issues in data and algorithms. In this contribution we study the relations between two types of statistical fairness measures namely Statistical-Parity and Equalized-Odds. The Statistical-Parity measure does not rely on having ground truth, i.e., (objectively) labeled target attributes. This makes Statistical-Parity a suitable measure in practice for assessing fairness in data and data classification algorithms. Therefore, Statistical-Parity is adopted in many legal and professional frameworks for assessing algorithmic fairness. The Equalized-Odds measure, on the contrary, relies on having (reliable) ground-truth, which is not always feasible in practice. Nevertheless, there are several situations where the Equalized-Odds definition should be satisfied to enforce false prediction parity among sensitive social groups. We present a novel analyze of the relation between Statistical-Parity and Equalized-Odds, depending on the base-rates of sensitive groups. The analysis intuitively shows how and when base-rate imbalance causes incompatibility between Statistical-Parity and Equalized-Odds measures. As such, our approach provides insight in (how to make design) trade-offs between these measures in practice. Further, based on our results, we plea for examining base-rate (im)balance and investigating the possibility of such an incompatibility before enforcing or relying on the Statistical-Parity criterion. The insights provided, we foresee, may trigger initiatives to improve or adjust the current practice and/or the existing legal frameworks.
The sparse object detection paradigm shift towards dense 3D semantic occupancy prediction is necessary for dealing with long-tail safety challenges for autonomous vehicles. Nonetheless, the current voxelization methods commonly suffer from excessive computation complexity demands, where the fusion process is brittle, static, and breaks down under dynamic environmental settings. To this end, this research work enhances a novel Gaussian-based adaptive camera-LiDAR multimodal 3D occupancy prediction model that seamlessly bridges the semantic strengths of camera modality with the geometric strengths of LiDAR modality through a memory-efficient 3D Gaussian model. The proposed solution has four key components: (1) LiDAR Depth Feature Aggregation (LDFA), where depth-wise deformable sampling is employed for dealing with geometric sparsity, (2) Entropy-Based Feature Smoothing, where cross-entropy is employed for handling domain-specific noise, (3) Adaptive Camera-LiDAR Fusion, where dynamic recalibration of sensor outputs is performed based on model outputs, and (4) Gauss-Mamba Head that uses Selective State Space Models for global context decoding that enjoys linear computation complexity.
In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low-quality imaging conditions. A common strategy is to integrate single-image super-resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency-Driven multi-task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer-based shared encoder, where hierarchical window-shifted self-attention supports cross-task feature collaboration and adaptively balances the trade-off between texture refinement and semantic representation. In addition, a multi-scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi-task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection-oriented direction, enabling the framework to guide the SR branch to generate high-frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR-10-Split, DOTAv1.5-Split, and HRSSD-Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low-quality remote sensing images. Our code is available at https://github.com/qiruo-ya/SDCoNet.
The adoption of Large Language Models (LLMs) in scientific writing promises efficiency but risks introducing informational entropy. While "hallucinated papers" are a known artifact, the systematic degradation of valid citation chains remains unquantified. We conducted a forensic audit of 50 recent survey papers in Artificial Intelligence (N=5,514 citations) published between September 2024 and January 2026. We utilized a hybrid verification pipeline combining DOI resolution, Crossref metadata analysis, Semantic Scholar queries, and fuzzy text matching to distinguish between formatting errors ("Sloppiness") and verifiable non-existence ("Phantoms). We detect a persistent 17.0% Phantom Rate -- citations that cannot be resolved to any digital object despite aggressive forensic recovery. Diagnostic categorization reveals three distinct failure modes: pure hallucinations (5.1%), hallucinated identifiers with valid titles (16.4%), and parsing-induced matching failures (78.5%). Longitudinal analysis reveals a flat trend (+0.07 pp/month), suggesting that high-entropy citation practices have stabilized as an endemic feature of the field. The scientific citation graph in AI survey literature exhibits "link rot" at scale. This suggests a mechanism where AI tools act as "lazy research assistants," retrieving correct titles but hallucinating metadata, thereby severing the digital chain of custody required for reproducible science.
Reliable drone detection is challenging due to limited annotated real-world data, large appearance variability, and the presence of visually similar distractors such as birds. To address these challenges, this paper introduces SimD3, a large-scale high-fidelity synthetic dataset designed for robust drone detection in complex aerial environments. Unlike existing synthetic drone datasets, SimD3 explicitly models drones with heterogeneous payloads, incorporates multiple bird species as realistic distractors, and leverages diverse Unreal Engine 5 environments with controlled weather, lighting, and flight trajectories captured using a 360 six-camera rig. Using SimD3, we conduct an extensive experimental evaluation within the YOLOv5 detection framework, including an attention-enhanced variant termed Yolov5m+C3b, where standard bottleneck-based C3 blocks are replaced with C3b modules. Models are evaluated on synthetic data, combined synthetic and real data, and multiple unseen real-world benchmarks to assess robustness and generalization. Experimental results show that SimD3 provides effective supervision for small-object drone detection and that Yolov5m+C3b consistently outperforms the baseline across in-domain and cross-dataset evaluations. These findings highlight the utility of SimD3 for training and benchmarking robust drone detection models under diverse and challenging conditions.