[ENG] How to accelerate AI model Inference on GPU：A Hands-on introduction to TensorRT PART I
We will share 2 posts to give you a Hands-on introduction to TensorRT.
When the image data is input into the model, it is necessary to go through multiple optimized weighted hidden layers, and finally get an output result. This process is called inference.
The inference process varies according to the amount of model parameters, and the time it takes may be a few seconds. If your application has a limitation in computing time, i.e. the time required for image input to output needs to be shorter, can it also achieve the performance requirements with the GPU? The answer is yes. We can use the GPU to reduce the time for the time-consuming and complex model training, and we can also reduce the inference time with the GPU.
Image source (NVIDIA official website)
Data scientists can use the cuDNN library and related GPU libraries such as cuBLAS, cuSPARSE, or cuFFT to optimize GPU model training in different deep learning frameworks. As for inference, this article will introduce TensorRT, an SDK also developed by NVIDIA. TensorRT's method is mainly through conversion precision (single-precision floating-point number, half-precision floating-point number, or 8-bit integer), and improves latency, throughput, and efficiency.
TensorRT now supports multiple frameworks. TensorRT models such as Caffe, TensorFlow, PyTorch, Chainer, and MxNet can be generated by converting through the Python / C ++ API. Tensorflow has integrated TensorRT into the framework. Except for Caffe, which can be directly converted using TensorRT Parser (model parser). The rest of the framework needs to be converted to the ONNX format before converting with TensorRT Parser.
Five things to note about inferences
- Improve throughput: how much can be output at a certain time. In this case, it refers to inference/second or samples/second, which is vital to the efficient operation of each machine.
- Improve computing efficiency: In the case where the hardware specifications of each server are fixed, increase the efficiency of software operation in order to increase the throughput.
- Reduce latency: The time for running inference is usually in milliseconds. Low latency can not only reduces the overall prediction time, but also increases the smoothness of the software.
- Keep the correct rate: When a trained neural network is used for inference, the ability to correctly predict even when the precision is reduced or channel pruning is performed.
- Reduce GPU memory usage: When the system deploys a neural network model, if the model is too large, it will occupy too much memory. It is also very important to effectively reduce the memory resources occupied by the model.
When the inference work is carried out, if one of the five major items is missing, the calculation efficiency may be poor or the inference correct rate may be poor due to insufficient hardware capability or full system load. Therefore, having effective inference tools is very important for the application. TensorRT has the above five advantages. In addition to the first three, TensorRT can make inferences while maintaining almost the same correct rate (only about 0.2% reduction), and supports half-precision floating-point numbers or 8-bit integer through RTX Quadro, which can significantly reduce memory usage. The following will explain how TensorRT performs model optimization through GoogLeNet's Inception layer.
The Inception architecture in the original GoogLeNet
Those familiar with the model architecture are well aware that the concept of GoogLeNet is to solve the two major problems that traditional deep networks tend to encounter when the network deepens: over fitting and increased computation. The concept of a network in network increases the depth of the network, but reduces the amount of parameters. The network uses 1x1, 3x3, and 5x5 convolution kernels and 3x3 max pooling. Of course, this article will not elaborate on the GoogLeNet architecture, but rather explain the process of TensorRT converting GoogLeNet.
TensorRT merges Inception
There three layers in the convolution kernel: ReLU, bias, and conv. Here the three layers are combined (as shown in the CBR above) and a single kernel is used to execute. It’s like that that we bought three things that were originally checked out separately, and TensorRT makes the three things to check out together.
TensorRT merges 1x1 CBR
TensorRT can also recognize hidden layers with different weights that share the same input data and filer size, such as 1x1 conv. Combine them into 1x1 CBR, and merge them horizontally into a single wider kernel for execution.
TensorRT optimizes network architecture and cancels concat layer
Then TensorRT can be directly connected to the place where it is needed, and there is no need to do concat operation, so this layer can also be removed. In this process, we can see that the two boxes are not related to each other, so we can enable the two calculation streams separately and compute them separately. The above figure is also the result of TensorRT's optimization of the Inception layer network. This process is called fusion layers of network. After this process, the network can effectively reduce the number of hidden layers, which also means reducing the amount of calculation.
|Network||Layers||Layers after fusion|
|Inception V3||309 ||113|
The performance of the model after TensorRT optimization can be seen in the figure below. Looking at the actual data provided by NVIDIA on GitHub, it shows that TensorRT is about 1.5-4 times faster. This is a BERT (Bidirectional Encoder Representations from Transformers) language model commonly used in the field of NLP (Natural Language Processing) to convert TensorFlow through TensorRT.