[ Teaching ]  About Object Detection Part I: YOLO (v1~v3)

About Object Detection Part I: YOLO (v1~v3)

  By : Leadtek AI Expert     1106

Object Detection is generally implemented in two different ways. The procedure includes one-stage object detection and two-stage object detection. The common method is as follows:

One-stage object detection: YOLO (v1~v3)
Two-stage object detection: R-CNN (or Fast R-CNN and Faster R-CNN)


The feature extraction of YOLOv1 is done by GoogLeNet, and then through fully-connected layers (FC), and finally resized to 7x7x30 output.

How did YOLOv1 predict? Each element of 7x7 (called cell grid) predicts two differently shaped boxes, and that the number of channels is 30 means 2*(4+1)+20=30, which is (number of boundary boxes)* (the coordinates of the center of the box (x, y) and the width and height (w, h) + the Confidence that there are objects in the box) + (the number of categories). When (4+1) remains unchanged, if the number of categories is increased or decreased, the overall output size will be different.

YOLOv1 predicts the probability that the object is in the box, and also predicts the IOU between the predicted box and the ground truth box.

When predicting the probability of a category, the probability represents "the probability of occurrence of this category given object", and YOLOv1 is multiplied by Confidence when making predictions.

The above formula is "the probability of occurrence of this category" multiplied by the IOU as a reference value, and then Non-maximum suppression (NMS) is performed in different categories, and the result of the final output is obtained.

What is IOU(intersection over union)?

A method of measuring the similarity between a predicted box and a ground truth box. The IOU value is between 0 and 1, and the higher the value, the more similar the position and size of the two boxes are.

What is NMS(Non-maximum suppression)?

When many cell grids indicate that an object and a corresponding category are predicted, if the result is output without filtering, there will be many prediction boxes with similar positions and sizes that overlap. Therefore, it is necessary to, according to the predicted Confidence and corresponding category probability of each cell grid, find out the prediction box with the highest value and keep it, and then calculate if there are prediction boxes with lower probability and the IOU of the prediction box are larger than a certain threshold (like 0.5). If it is greater than the threshold, the prediction box with lower probability will be discarded. If it is less than the threshold, the prediction box with lower probability will be retained, and finally the result will be output.

YOLOv1 Defect:

If there are two objects whose centers are assigned to the same cell grid, it is impossible to predict two objects and categories at the same time in YOLOv1. Only the category with the highest probability can be found out from the category prediction value, and it is determined that there is only that category in the grid cell.


In order to solve the YOLOv1 defect, this version has been improved to predict the category values in each prediction box of different sizes, so that a grid cell can output prediction boxes of more than one category. Compared with Yolov1 that uses GoogLeNet, YOLOv2 uses DarkNet-19 instead and made some improvements.

YOLOv2 adds Batch normalization to all convolutional layers to fix vanishing gradient problem. And during model training, one more stage is added, so the YOLOv2 model is only trained how to detect the objects, but YOLOv1 has to adapt to the change of resolution and learn the ability to identify objects at the same time, which is one of the improvements of YOLOv2, increasing the final performance.

In addition, YOLOv1 directly outputs the position and size when predicting the prediction box. YOLOv2 refers to the Anchor box concept of Faster R-CNN, and hopes to output only the translation and scaling amount of the anchor box when making predictions. Unlike Faster R-CNN, YOLOv2 uses the k-means clustering algorithm to define the size of the anchor box. The purpose is to make the IOU between the prediction box and the ground truth box as high as possible, and replace the Euclidean distance with IOU in the algorithm as a similarity measure.

YOLOv2 adds a pass-through layer to allow high-resolution feature maps to be attached to low-resolution feature maps (after channel) so that information with more precise locations can be preserved.

When learning object detection, you hope that when you input images of different sizes, they can be predicted accurately. Therefore, whenever you trained 10 batches, the size of the input image will be randomly changed. The size of the changed image is 416± multiples of 32, and the size of the output feature map will vary depending on the size of the input image.


YOLOv3 uses DarkNet-53 instead. The network contains the residual block structure, and a shortcut is added to the input, directly passing signals through the weight layers and activation function. The main function of the residual block is to increase the flow of information and to make the structure of the nonlinear convolution layer easier to form Identity mapping.

Another improvement is about the Feature Pyramid network. The most classic and traditional method is the single feature map, where the image is output as a single one via the convolutional layer. The Feature pyramid network (FPN) means that that prediction of each layer scale integrates feature map information of different scales, so when making predictions, the image semantics and spatial features of different scales are more sufficient, and thus better prediction results are obtained.

The number of YOLOv3 for the anchor box has been also increased from 5 in YOLOv2 to 9. In the last three different scales, the output layer is predicted by 3 anchor boxes to increase the prediction accuracy.
Comments as following