Abstract:Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections -- a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.
Abstract:Violence detection in public surveillance is critical for public safety. This study addresses challenges such as small-scale targets, complex environments, and real-time temporal analysis. We propose Vi-SAFE, a spatial-temporal framework that integrates an enhanced YOLOv8 with a Temporal Segment Network (TSN) for video surveillance. The YOLOv8 model is optimized with GhostNetV3 as a lightweight backbone, an exponential moving average (EMA) attention mechanism, and pruning to reduce computational cost while maintaining accuracy. YOLOv8 and TSN are trained separately on pedestrian and violence datasets, where YOLOv8 extracts human regions and TSN performs binary classification of violent behavior. Experiments on the RWF-2000 dataset show that Vi-SAFE achieves an accuracy of 0.88, surpassing TSN alone (0.77) and outperforming existing methods in both accuracy and efficiency, demonstrating its effectiveness for public safety surveillance. Code is available at https://anonymous.4open.science/r/Vi-SAFE-3B42/README.md.