There are a number of machine learning approaches to object detection tasks. Examples include the Viola-Jones framework8 and the histogram of oriented gradients.9 Recent object detection research and development, however, has focused largely on convolutional neural networks (CNNs). As such, this page focuses on two types of CNNs most discussed in object detection research. Note that these models are tested and compared using benchmark datasets, such as the Microsoft COCO dataset or ImageNet.
R-CNN (region-based convolutional neural network) is a two-stage detector that uses a method called region proposals to generate 2,000 region predictions per image. R-CNN then warps the extracted regions to a uniform size and runs those regions through separate networks for feature extraction and classification. Each region is ranked according to the confidence of its classification. R-CNN then rejects regions that have a certain IoU overlap with a higher scoring selected region. The remaining non-overlapping and top-ranking classified regions are the model’s output.10 As expected, this architecture is computational expensive and slow. Fast R-CNN and Faster R-CNN are later modifications that reduce the size of the R-CNN’s architecture and thereby decrease processing time while also increasing accuracy.11
YOLO (You Only Look Once) is a family of single-stage detection architectures based in Darknet, an open-source CNN framework. First developed in 2016, the YOLO architecture prioritizes speed. Indeed, YOLO’s speed makes it preferable for real-time object detection and has earned it the common descriptor of state-of-the-art object detector. YOLO differs from R-CNN in several ways. While R-CNN passes extracted image regions through multiple networks that separately extract features and classify images, YOLO condenses these actions into a single network. Secondly, compared to R-CNN’s ~2000 region proposals, YOLO makes less than 100 bounding box predictions per image. In addition to being faster than R-CNN, YOLO also produces less background false positives, although it has a higher localization error.12 There have been many updates to YOLO since its inception, generally focusing on speed and accuracy.13
Though originally developed for object detection, later versions of R-CNN and YOLO can also train classification and segmentation models. Specifically, Mask R-CNN combines both object detection and segmentation, while YOLOv5 can train separate classification, detection, and segmentation models.
Of course, there are many other model architectures beyond R-CNN and YOLO. SSD and Retinanet are two additional models that use a simplified architecture similar to YOLO.14 DETR is another architecture developed by Facebook (now Meta) that combines CNN with a transformer model and shows performance comparable to Faster R-CNN.15