Abstract:Reliable drone detection is challenging due to limited annotated real-world data, large appearance variability, and the presence of visually similar distractors such as birds. To address these challenges, this paper introduces SimD3, a large-scale high-fidelity synthetic dataset designed for robust drone detection in complex aerial environments. Unlike existing synthetic drone datasets, SimD3 explicitly models drones with heterogeneous payloads, incorporates multiple bird species as realistic distractors, and leverages diverse Unreal Engine 5 environments with controlled weather, lighting, and flight trajectories captured using a 360 six-camera rig. Using SimD3, we conduct an extensive experimental evaluation within the YOLOv5 detection framework, including an attention-enhanced variant termed Yolov5m+C3b, where standard bottleneck-based C3 blocks are replaced with C3b modules. Models are evaluated on synthetic data, combined synthetic and real data, and multiple unseen real-world benchmarks to assess robustness and generalization. Experimental results show that SimD3 provides effective supervision for small-object drone detection and that Yolov5m+C3b consistently outperforms the baseline across in-domain and cross-dataset evaluations. These findings highlight the utility of SimD3 for training and benchmarking robust drone detection models under diverse and challenging conditions.
Abstract:Unmanned Aerial Vehicles, commonly known as, drones pose increasing risks in civilian and defense settings, demanding accurate and real-time drone detection systems. However, detecting drones is challenging because of their small size, rapid movement, and low visual contrast. A modified architecture of YolovN called the YolovN-CBi is proposed that incorporates the Convolutional Block Attention Module (CBAM) and the Bidirectional Feature Pyramid Network (BiFPN) to improve sensitivity to small object detections. A curated training dataset consisting of 28K images is created with various flying objects and a local test dataset is collected with 2500 images consisting of very small drone objects. The proposed architecture is evaluated on four benchmark datasets, along with the local test dataset. The baseline Yolov5 and the proposed Yolov5-CBi architecture outperform newer Yolo versions, including Yolov8 and Yolov12, in the speed-accuracy trade-off for small object detection. Four other variants of the proposed CBi architecture are also proposed and evaluated, which vary in the placement and usage of CBAM and BiFPN. These variants are further distilled using knowledge distillation techniques for edge deployment, using a Yolov5m-CBi teacher and a Yolov5n-CBi student. The distilled model achieved a mA@P0.5:0.9 of 0.6573, representing a 6.51% improvement over the teacher's score of 0.6171, highlighting the effectiveness of the distillation process. The distilled model is 82.9% faster than the baseline model, making it more suitable for real-time drone detection. These findings highlight the effectiveness of the proposed CBi architecture, together with the distilled lightweight models in advancing efficient and accurate real-time detection of small UAVs.




Abstract:Accurate camera models are essential for photogrammetry applications such as 3D mapping and object localization, particularly for long distances. Various stereo-camera based 3D localization methods are available but are limited to few hundreds of meters' range. This is majorly due to the limitation of the distortion models assumed for the non-linearities present in the camera lens. This paper presents a framework for modeling a suitable distortion model that can be used for localizing the objects at longer distances. It is well known that neural networks can be a better alternative to model a highly complex non-linear lens distortion function; on contrary, it is observed that a direct application of neural networks to distortion models fails to converge to estimate the camera parameters. To resolve this, a hybrid approach is presented in this paper where the conventional distortion models are initially extended to incorporate higher-order terms and then enhanced using neural network based residual correction model. This hybrid approach has substantially improved long-range localization performance and is capable of estimating the 3D position of objects at distances up to 5 kilometres. The estimated 3D coordinates are transformed to GIS coordinates and are plotted on a GIS map for visualization. Experimental validation demonstrates the robustness and effectiveness of proposed framework, offering a practical solution to calibrate CCTV cameras for long-range photogrammetry applications.