[ Teaching ]  About Object Detection Part II: R-CNN Family

About Object Detection Part II: R-CNN Family

  By : Leadtek AI Expert     7269

Remember last time we've talked about the one-stage object detection method YOLO (v1~v3)?

Let's do a quick recap here. 

Object Detection is generally implemented in two different ways. It can be divided into one-stage object detection and two-stage object detection in terms of procedure. The common application methods are as follows:

One-stage object detection: YOLO (v1~v3)

Two-stage object detection: R-CNN (or Fast R-CNN and Faster R-CNN)

We will introduce R-CNN in this chapter.


After the image is input to R(region)-CNN, the Region Proposals stage is used to find out the region where objects may be present, and then the feature extractor is used to extract the feature, confirm whether the region has object localization, and finally output the classification.


The method used by R-CNN in the Region Proposals stage is Selective Search (SS). Selective Search calculates the similarity between each place and the adjacent region of the original picture, and merges the regions with the highest similarity into one group. After iteration, the whole picture forms a group. This algorithm is called Hierarchical Grouping Algorithm.

What are the steps of R-CNN?

R-CNN generally uses the following steps to classify objects.
1. Find 2000 regions (Region Proposals) that may have objects in the original image.
2. Resize the regions to 227x227 (AlexNet) / 224x224 (VGG16) and extract features via CNN.
3. The final output layer (4096 dimensions) predicts the corrections of categories (SVMs) and prediction boxes (Bbox reg).


R-CNN defects:

The biggest drawback of this type of method is that the speed is slower and the Region Proposal size has to be the same, which causes the image to be distorted due to dimensional changes, and the original information is lost or too much background information is introduced, resulting in a decrease in accuracy.

Fast R-CNN:

Since Region Proposals obtained by using SS in R-CNN will enter CNN for feature extraction, 2000 regions mean that there are 2000 feature extraction operations. In fact, when CNN does feature extraction, some Region Proposal overlaps in some regions, which means that overlaps are repeatedly calculated during the CNN feature extraction process.


Region Proposals on the feature map are called region of interest (ROI). Fast R-CNN resizes all ROIs to 6x6, and via fully connected layers (FCs) outputs the regression values of the prediction box (Linear) and category error (softmax).


Advantages and disadvantages of Fast R-CNN:

Fast R-CNN improves the computation time of the R-CNN model, because only one CNN is needed, so the overall computation time is saved, and using a single network simplifies the model training process. However, the process of finding the candidate box is actually very time-consuming, so in order to solve this problem, Faster R-CNN was born.

Faster R-CNN:

In order to improve the speed of prediction, Faster R-CNN abandoned the SS algorithm and added a Region Proposal Network (RPN) for CNN architecture to obtain Region Proposals.

The following diagram is a more detailed description of the flow after the feature extractor outputs the feature map.


The RPN replaces the SS algorithm after the last layer of convolutional layer, and is trained by the RPN to obtain candidate regions. The RPN procedure is shown below.


The anchor box was introduced in Faster R-CNN, so the regression value of the estimated prediction box is changed to the offset of the anchor box. The Anchor box has three custom sizes (8, 16, 32) and the length to width ration is (0.5, 1, 2), so the number of anchor boxes is 3 * 3 = 9 = K. In addition, because the process has done four pooling layers, the size of the Feature map (8, 16, 32) on the original map is (128, 256, 512), which is magnified by 16 times.


About RPN:

In the RPN Region Proposal stage, the RPN outputs two convolutional layers from a feature map. The convolutional layer with a channel number of 2K means the probability that whether an anchor box is an object (foreground), or whether it’s a background. Since there are 9 anchor boxes, the total number of channels is 2 * 9 = 18 predicted values. The convolutional layer with a channel number of 4K means the offset amount (t_x, t_y, t_w, t_h) between an anchor box and the ground-truth box, so the total number of channels is 4 * 9 = 36 predicted values.

How does RPN define positive and negative samples? After the feature extractor output the feature map from the input image. The Faster R-CNN sets the anchor point at the center of each element of the feature map, and the 9 anchor boxes are overlapped with the anchor point to calculate the IOU between the anchor box and the ground-truth box.

If the IOU is greater than the threshold (0.7), the anchor box at the position is a positive sample; if the IOU is less than the threshold (0.3), the anchor box at the position is a negative sample. Since some ground-truth boxes may not be assigned an anchor box for prediction, Faster R-CNN also sets the anchor box that has the largest IOU with the ground-truth box as a positive sample.

About ROI:

The RPN outputs ROI flow is as follows.
1. Predict the anchor box of the positive sample and correct it by box linear regression to get the ROI.
2. The ROI is sorted by foreground probability from high to low, and the first 6000 ROIs are taken.
3. Discard ROIs that are too large for the picture boundary and too small in size.
4. Do Non-maximum suppression to filter the ROIs with similar position and size.
5. The ROI is sorted by foreground probability from high to low, and the first 300 ROIs are taken as the output of the RPN.
The ROI output by the RPN differs in size and will eventually be resized to 7x7 (the number of channels remains the same) as the input to the following classifier. Take the bicycle picture below as an example.

The output layer of the category is the number of categories +1 background value. When we find the category corresponding to the highest probability (assumed to be a bicycle), the ROI linear regression corresponding position is the amount of offset that the ROI should correct.


In summary, from R-CNN, Fast R-CNN to Faster R-CNN, the process based on deep learning target detection has become more and more streamlined, with higher precision and faster speed. The Region Proposal-based R-CNN series target detection method is the most important branch of the current target detection technology.

Comments as following