How to accelerate AI model Inference on GPU：A Hands-on introduction to TensorRT PART I
The previous articles are about the introduction and explanation of model building, but the actual application after the model is completed also requires learning.
We will share 2 posts this month to give you a Hands-on introduction to TensorRT.
When the image data is input into the model, it is necessary to go through multiple optimized weighted hidden layers, and finally get an output result. This process is called inference.
The inference process varies according to the amount of model parameters, and the time it takes may be a few seconds. If your application has a limitation in computing time, i.e. the time required for image input to output needs to be shorter, can it also achieve the performance requirements with the GPU? The answer is yes. We can use the GPU to reduce the time for the time-consuming and complex model training, and we can also reduce the inference time with the GPU.
Image source (NVIDIA official website)
TensorRT now supports multiple frameworks. TensorRT models such as Caffe, TensorFlow, PyTorch, Chainer, and MxNet can be generated by converting through the Python / C ++ API. Tensorflow has integrated TensorRT into the framework. Except for Caffe, which can be directly converted using TensorRT Parser (model parser). The rest of the framework needs to be converted to the ONNX format before converting with TensorRT Parser.
Five things to note about inferences
- Improve throughput: how much can be output at a certain time. In this case, it refers to inference/second or samples/second, which is vital to the efficient operation of each machine.
- Improve computing efficiency: In the case where the hardware specifications of each server are fixed, increase the efficiency of software operation in order to increase the throughput.
- Reduce latency: The time for running inference is usually in milliseconds. Low latency can not only reduces the overall prediction time, but also increases the smoothness of the software.
- Keep the correct rate: When a trained neural network is used for inference, the ability to correctly predict even when the precision is reduced or channel pruning is performed.
- Reduce GPU memory usage: When the system deploys a neural network model, if the model is too large, it will occupy too much memory. It is also very important to effectively reduce the memory resources occupied by the model.
When the inference work is carried out, if one of the five major items is missing, the calculation efficiency may be poor or the inference correct rate may be poor due to insufficient hardware capability or full system load. Therefore, having effective inference tools is very important for the application. TensorRT has the above five advantages. In addition to the first three, TensorRT can make inferences while maintaining almost the same correct rate (only about 0.2% reduction), and supports half-precision floating-point numbers or 8-bit integer through RTX Quadro, which can significantly reduce memory usage. The following will explain how TensorRT performs model optimization through GoogLeNet's Inception layer.
The Inception architecture in the original GoogLeNet
TensorRT merges Inception
There three layers in the convolution kernel: ReLU, bias, and conv. Here the three layers are combined (as shown in the CBR above) and a single kernel is used to execute. It’s like that that we bought three things that were originally checked out separately, and TensorRT makes the three things to check out together.
TensorRT merges 1x1 CBR
TensorRT can also recognize hidden layers with different weights that share the same input data and filer size, such as 1x1 conv. Combine them into 1x1 CBR, and merge them horizontally into a single wider kernel for execution.
TensorRT optimizes network architecture and cancels concat layer
Then TensorRT can be directly connected to the place where it is needed, and there is no need to do concat operation, so this layer can also be removed. In this process, we can see that the two boxes are not related to each other, so we can enable the two calculation streams separately and compute them separately. The above figure is also the result of TensorRT's optimization of the Inception layer network. This process is called fusion layers of network. After this process, the network can effectively reduce the number of hidden layers, which also means reducing the amount of calculation.
|Layers after fusion
The performance of the model after TensorRT optimization can be seen in the figure below. Looking at the actual data provided by NVIDIA on GitHub, it shows that TensorRT is about 1.5-4 times faster. This is a BERT (Bidirectional Encoder Representations from Transformers) language model commonly used in the field of NLP (Natural Language Processing) to convert TensorFlow through TensorRT.