Object Detection Model

Overview

In our project focused on building an accurate and efficient object detection system for SUAS (Small Unmanned Aerial Systems), we explored and compared several state-of-the-art object detection models. Our goal was to identify a model that could not only deliver high accuracy but also perform fast enough for real-time drone operations. The models we evaluated included traditional CNN-based detectors like Faster R-CNN and SSD, transformer-based models like DETR, and newer vision-language models. However, after detailed testing, it became clear that YOLO (You Only Look Once) offered the best overall performance and was the most suitable for our specific application.

Why Not Basic CNN Models?

Traditional CNN-based models were the earliest tools used in object detection tasks. These models typically work by applying a sliding window over the image and using a CNN to classify each region. While this method is simple, it is highly inefficient. It cannot handle multiple objects well, especially if they are overlapping, small, or located in complex scenes. These models are also too slow for real-time use because they process many redundant regions. In aerial drone imagery, where speed and accuracy are critical, traditional CNNs simply don’t meet the required standards.

Comparison with Faster R-CNN and SSD

We then evaluated Faster R-CNN, which is a two-stage detector. It uses a Region Proposal Network (RPN) to first suggest object regions and then classifies them. Faster R-CNN is highly accurate, especially for larger objects in ground-level images. However, it is too slow for real-time use. On drones, where every millisecond counts, the delay in prediction caused by its two-step process becomes a serious limitation.

Next, we looked at SSD (Single Shot MultiBox Detector), which is a single-stage model like YOLO. SSD is faster than Faster R-CNN and easier to implement. However, in our tests on aerial and synthetic data, SSD often missed small objects, misclassified overlapping ones, and had lower precision overall. Its architecture was not as robust when applied to the type of small, varied, and high-angle targets found in drone imagery.

Comparison with DETR and Transformer-Based Models

We also tried DETR (DEtection TRansformer), a modern transformer-based model that treats object detection as a direct set prediction task. It eliminates many traditional components like anchor boxes and non-max suppression. Although DETR is promising and performs well on large datasets, it has significant drawbacks for our application. It requires very long training times and a lot of data to converge properly, which is difficult to manage with synthetic datasets. Moreover, the computational load was too high for edge devices like drone hardware. DETR also showed inconsistent performance, with frequent missed or false detections when trained on our synthetic data.

Similarly, we experimented with fine-tuning large visual-language models for detection tasks. These models are impressive in general computer vision tasks but were unstable and unreliable in our use case. They also required extensive GPU memory and inference time, making them impractical for deployment on drones.

Why YOLO is Better

After testing all major options, YOLO clearly stood out as the most balanced and practical solution for SUAS object detection. Here's why:

Speed: YOLO is extremely fast. Its single-shot architecture allows it to detect multiple objects in real time, processing full images in a single pass. This is perfect for drones where immediate feedback is necessary.

Accuracy: Despite being fast, YOLO still achieves very high accuracy, even on small or overlapping objects. It handles the kinds of challenging conditions often found in aerial images much better than SSD or DETR.

Lightweight and Efficient: YOLO models are relatively lightweight and run efficiently on limited hardware. This makes them ideal for edge devices like onboard drone processors where resources are limited.

Robust with Synthetic Data: YOLO trained very well on our custom synthetic dataset, which we created to simulate realistic SUAS environments. The model adapted quickly to the data and provided consistent and reliable detections, even when tested on real-world images.

Easy to Train and Customize: YOLO is highly customizable. We were able to train it quickly on our dataset, and it supported annotations in YOLO, XML, and JSON formats. This saved significant time in preparing our training pipeline.