What is Head Detection? Head detection is the process of identifying and locating human heads in images or videos.
Papers and Code
May 09, 2025
Abstract:Community detection in networks with overlapping structures remains a significant challenge, particularly in noisy real-world environments where integrating topology, node attributes, and prior information is critical. To address this, we propose a semi-supervised graph autoencoder that combines graph multi-head attention and modularity maximization to robustly detect overlapping communities. The model learns semantic representations by fusing structural, attribute, and prior knowledge while explicitly addressing noise in node features. Key innovations include a noise-resistant architecture and a semantic semi-supervised design optimized for community quality through modularity constraints. Experiments demonstrate superior performance the model outperforms state-of-the-art methods in overlapping community detection (improvements in NMI and F1-score) and exhibits exceptional robustness to attribute noise, maintaining stable performance under 60\% feature corruption. These results highlight the importance of integrating attribute semantics and structural patterns for accurate community discovery in complex networks.
Via

May 09, 2025
Abstract:Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
Via

May 08, 2025
Abstract:We present a systematic study of provider-side data poisoning in retrieval-augmented recommender systems (RAG-based). By modifying only a small fraction of tokens within item descriptions -- for instance, adding emotional keywords or borrowing phrases from semantically related items -- an attacker can significantly promote or demote targeted items. We formalize these attacks under token-edit and semantic-similarity constraints, and we examine their effectiveness in both promotion (long-tail items) and demotion (short-head items) scenarios. Our experiments on MovieLens, using two large language model (LLM) retrieval modules, show that even subtle attacks shift final rankings and item exposures while eluding naive detection. The results underscore the vulnerability of RAG-based pipelines to small-scale metadata rewrites and emphasize the need for robust textual consistency checks and provenance tracking to thwart stealthy provider-side poisoning.
Via

May 07, 2025
Abstract:This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.
Via

May 07, 2025
Abstract:Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS.
* Accepted by TIP 2025
Via

May 01, 2025
Abstract:In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.
Via

May 03, 2025
Abstract:Biosignal monitoring, in particular heart activity through heart rate (HR) and heart rate variability (HRV) tracking, is vital in enabling continuous, non-invasive tracking of physiological and cognitive states. Recent studies have explored compact, head-worn devices for HR and HRV monitoring to improve usability and reduce stigma. However, this approach is challenged by the current reliance on wet electrodes, which limits usability, the weakness of ear-derived signals, making HR/HRV extraction more complex, and the incompatibility of current algorithms for embedded deployment. This work introduces a single-ear wearable system for real-time ECG (Electrocardiogram) parameter estimation, which directly runs on BioGAP, an energy-efficient device for biosignal acquisition and processing. By combining SoA in-ear electrode technology, an optimized DeepMF algorithm, and BioGAP, our proposed subject-independent approach allows for robust extraction of HR/HRV parameters directly on the device with just 36.7 uJ/inference at comparable performance with respect to the current state-of-the-art architecture, achieving 0.49 bpm and 25.82 ms for HR/HRV mean errors, respectively and an estimated battery life of 36h with a total system power consumption of 7.6 mW. Clinical relevance: The ability to reconstruct ECG signals and extract HR and HRV paves the way for continuous, unobtrusive cardiovascular monitoring with head-worn devices. In particular, the integration of cardiovascular measurements in everyday-use devices (such as earbuds) has potential in continuous at-home monitoring to enable early detection of cardiovascular irregularities.
* 7 pages, 7 figures, to be published in IEEE EMBC conference
Via

May 02, 2025
Abstract:Unmanned Aerial Vehicles (UAVs) rely on measurements from Inertial Measurement Units (IMUs) to maintain stable flight. However, IMUs are susceptible to physical attacks, including acoustic resonant and electromagnetic interference attacks, resulting in immediate UAV crashes. Consequently, we introduce a Model-based Anomaly detection and Recovery System (MARS) that enables UAVs to quickly detect adversarial attacks on inertial sensors and achieve dynamic flight recovery. MARS features an attack-resilient state estimator based on the Extended Kalman Filter, which incorporates position, velocity, heading, and rotor speed measurements to reconstruct accurate attitude and angular velocity information for UAV control. Moreover, a statistical anomaly detection system monitors IMU sensor data, raising a system-level alert if an attack is detected. Upon receiving the alert, a multi-stage dynamic flight recovery strategy suspends the ongoing mission, stabilizes the drone in a hovering condition, and then resumes tasks under the resilient control. Experimental results in PX4 software-in-the-loop environments as well as real-world MARS-PX4 autopilot-equipped drones demonstrate the superiority of our approach over existing IMU-defense frameworks, showcasing the ability of the UAVs to survive attacks and complete the missions.
Via

May 02, 2025
Abstract:The success of large pre-trained object detectors hinges on their adaptability to diverse downstream tasks. While fine-tuning is the standard adaptation method, specializing these models for challenging fine-grained domains necessitates careful consideration of feature granularity. The critical question remains: how deeply should the pre-trained backbone be fine-tuned to optimize for the specialized task without incurring catastrophic forgetting of the original general capabilities? Addressing this, we present a systematic empirical study evaluating the impact of fine-tuning depth. We adapt a standard YOLOv8n model to a custom, fine-grained fruit detection dataset by progressively unfreezing backbone layers (freeze points at layers 22, 15, and 10) and training. Performance was rigorously evaluated on both the target fruit dataset and, using a dual-head evaluation architecture, on the original COCO validation set. Our results demonstrate unequivocally that deeper fine-tuning (unfreezing down to layer 10) yields substantial performance gains (e.g., +10\% absolute mAP50) on the fine-grained fruit task compared to only training the head. Strikingly, this significant adaptation and specialization resulted in negligible performance degradation (<0.1\% absolute mAP difference) on the COCO benchmark across all tested freeze levels. We conclude that adapting mid-to-late backbone features is highly effective for fine-grained specialization. Critically, our results demonstrate this adaptation can be achieved without the commonly expected penalty of catastrophic forgetting, presenting a compelling case for exploring deeper fine-tuning strategies, particularly when targeting complex domains or when maximizing specialized performance is paramount.
Via

Apr 28, 2025
Abstract:The increasing demand for aquaculture production necessitates the development of innovative, intelligent tools to effectively monitor and manage fish health and welfare. While non-invasive video monitoring has become a common practice in finfish aquaculture, existing intelligent monitoring methods predominantly focus on assessing body condition or fish swimming patterns and are often developed and evaluated in controlled tank environments, without demonstrating their applicability to real-world aquaculture settings in open sea farms. This underscores the necessity for methods that can monitor physiological traits directly within the production environment of sea fish farms. To this end, we have developed a computer vision method for monitoring ventilation rates of Atlantic salmon (Salmo salar), which was specifically designed for videos recorded in the production environment of commercial sea fish farms using the existing infrastructure. Our approach uses a fish head detection model, which classifies the mouth state as either open or closed using a convolutional neural network. This is followed with multiple object tracking to create temporal sequences of fish swimming across the field of view of the underwater video camera to estimate ventilation rates. The method demonstrated high efficiency, achieving a Pearson correlation coefficient of 0.82 between ground truth and predicted ventilation rates in a test set of 100 fish collected independently of the training data. By accurately identifying pens where fish exhibit signs of respiratory distress, our method offers broad applicability and the potential to transform fish health and welfare monitoring in finfish aquaculture.
Via
