Abstract:With the rapid advancement of Unmanned Aerial Vehicle (UAV) and computer vision technologies, object detection from UAV perspectives has emerged as a prominent research area. However, challenges for detection brought by the extremely small proportion of target pixels, significant scale variations of objects, and complex background information in UAV images have greatly limited the practical applications of UAV. To address these challenges, we propose a novel object detection network Multi-scale Context Aggregation and Scale-adaptive Fusion YOLO (MASF-YOLO), which is developed based on YOLOv11. Firstly, to tackle the difficulty of detecting small objects in UAV images, we design a Multi-scale Feature Aggregation Module (MFAM), which significantly improves the detection accuracy of small objects through parallel multi-scale convolutions and feature fusion. Secondly, to mitigate the interference of background noise, we propose an Improved Efficient Multi-scale Attention Module (IEMA), which enhances the focus on target regions through feature grouping, parallel sub-networks, and cross-spatial learning. Thirdly, we introduce a Dimension-Aware Selective Integration Module (DASI), which further enhances multi-scale feature fusion capabilities by adaptively weighting and fusing low-dimensional features and high-dimensional features. Finally, we conducted extensive performance evaluations of our proposed method on the VisDrone2019 dataset. Compared to YOLOv11-s, MASFYOLO-s achieves improvements of 4.6% in mAP@0.5 and 3.5% in mAP@0.5:0.95 on the VisDrone2019 validation set. Remarkably, MASF-YOLO-s outperforms YOLOv11-m while requiring only approximately 60% of its parameters and 65% of its computational cost. Furthermore, comparative experiments with state-of-the-art detectors confirm that MASF-YOLO-s maintains a clear competitive advantage in both detection accuracy and model efficiency.