[ Teaching ]  How to accelerate AI model Inference on GPU:A Hands-on introduction to TensorRT PART II
  Comments:

How to accelerate AI model Inference on GPU:A Hands-on introduction to TensorRT PART II

  By : Leadtek AI Expert     2148

Last time we've introduced the concept of "Inference". (Article Review: https://forums.leadtek.com/en/post/25)

Now, we will show you how to Implement Caffe model conversion through TensorRT.


There are some data that have to be prepared before the implementation. If the model is implemented from NVIDIA DIGITS, you can download all the required data of TensorRT with the Download Model function in the NVIDIA DIGITS model training page.

 NVIDIA DIGITS screen illustration

 

Unzipped files of the downloaded model


After the model file is decoompressed, it contains several input files that TensorRT requires, including:

  • Network architecture file: The fixed file name is deploy.prototxt.
  • Network weight file: The name is not fixed, but the fixed extension is caffemodel.
  • Image average file: The fixed file name is mean.binaryproto. If the subtraction average calculation is performed during model training, you need to attach this file to ensure that the output result is correct.
  • Output node name: The classification problem is usually softmax.
  • Image resolution: The resolution of the input image at the beginning when training the model.

After the preparation is completed, start the TensorRT environment. It is recommended to download the latest TensorRT from the NVIDIA NGC website (https://ngc.nvidia.com/). Of course, this action requires the use of the Docker engine as a development platform.


Then add the above file name and settings to the main program. OUTPUT_NAME is the output form of the model. Because it is a classification problem, it will be set to the prob of the probability type. INPUT_SHAPE represents the resolution (number of channels, x, y) of the input image, and the DTYPE is set to the computing precision used by the original model.


network.mark_output(model_tensors.find(ModelData.OUTPUT_NAME))

The next step is to build the TensorRT engine, which includes setting builder.max_workspace_size = GiB (1) in the workspace; setting the conversion precision builder.fp16_mode = True; input and conversion model syntax network = network, dtype = ModelData.DTYPE); and giving the model output name network.mark_output (model_tensors.find (ModelData.OUTPUT_NAME))


Then use TensorRT's internal engine to use pycuda for memory management and optimization of parallel computing processes. It also uses the Page-Locked method and CUDA Stream method in CUDA. The purpose is to accelerate the computing process and reduce the latency for data input / output to the GPU.



Finally, set the input image processing method in the block below, convert the image from a three-dimensional array into a one-dimensional vector (normalize_image), and copy the array of one-dimensional vectors to pagelocked_buffer (h_input). The image data are normalized by subtracting the average image file before conversion.


The program will send data to the GPU in the block below. And return the prediction result from GPU to CPU.



After saving the TensorRT model, the settings are complete.



When you want to load, you have to use the following syntax



The actual benchmark shows that Caffe using tensorRT for inference can be about 3 times faster than using GPU alone.


 

The following provides an example in which a video is captured by WebCam and imported into OpenCV and then inferred through TensorRT.


Comments as following