[ Evaluation ]  RTX AI Workstation Performance in DL Application
  Comments:

RTX AI Workstation Performance in DL Application

  By : Leadtek AI Expert     1958

Data is fundamentally changing the way companies do business, driving demand for data scientists and increasing the complexity in their workflows. Leadtek introduces a purpose-built workstation that helps data scientists transform massive amounts of information into insights faster than ever before by accelerating data preparation, model training and visualization.


WinFast RTX AI Workstation is a turnkey system that combines the power of the world’s most advanced Quadro GPU with accelerated CUDA-X AI data science software to deliver a new breed of fully-integrated workstations to ensure maximum compatibility and reliability for data science.


In the following, we will show you the performance results in 2 different Leadtek RTX AI Workstations: (Mid-Range) WinFast WS830 and (High-End) WinFast WS1030. 


All tested workstations use TensorFlow 1.12 deep learning framework to implement image processing with ImageNet dataset. 

The 1st purpose is to demonstrate the AI application performance (Images/Sec) of single-GPU and multi-GPU. Higher scores in (Images/Sec) result means better performance. 

The 2nd purpose is to compare the performance of half-precision (FP16) floating points and single-precision (FP32) floating points with multi-GPU application. GPU will activate Tensor Cores for half-precision (FP16) calculations while CUDA Cores is used for single-precision (FP32) calculations. The purpose of this test is to check the Tensor Cores performance of RTX GPU during AI model training.


Platform Specification

WorkstationWinFast WS830WinFast WS1030

CPUIntel Xeon W-2123 *1Intel Xeon Gold 5122 *2
RAM2666MHz 32GB *42666MHz 32GB *6/td>
OSUbuntu 18.04 LTSUbuntu 18.04 LTS
Driver410.78410.78
Docker18.0918.09
nvidia-docker2.02.0
FrameworkTensorFlow 1.12TensorFlow 1.12

Test List

WorkstationTest ItemTest Target
WinFast 830Multi-GPU Performance

Compare AI training performance between Single GPU and Multi-GPU.

- RTX5000

- RTX6000

FP32 & FP16 Precision Performance

Compare AI training performance between Single-precision (FP32) floating points and Half-precision (FP16) floating points.

- RTX5000

- RTX6000

WinFast 1030Multi-GPU Performance

Compare AI training performance between Single GPU and Multi-GPU.

- RTX6000

- RTX8000

FP32 & FP16 Precision Performance

Compare AI training performance between Single-precision (FP32) floating points and Half-precision (FP16) floating points.

- RTX6000

- RTX8000



WinFast WS830 Test Result

Multi-GPU Performance 

Carrying 900watts power supply, WinFast WS830 supports up to 2pcs Quadro RTX5000 (and above) professional graphics. From the test result shown below, we can see that TensorFlow performance through NVLink only stand out in AlexNet instead of other AI models. Other AI models template may not be optimized for multi-GPU data exchange, and it does not mean other AI models cannot benefit from NVLink high-speed data throughput. Take AlexNet AI model for example, the performance of multi-GPU increased 10%~30% with NVLink interconnect. Compared to single GPU application, multi-GPU performance increased 65%~110%. Most AI model performance increased over 85% with multi-GPU application, which mean we can get 2x performance with 2pcs GPU computing compared to single GPU computing.


(FP32) Single-precision Performance of QUADRO RTX5000 with WinFast WS830

 

(FP16) Half-precision Performance of QUADRO RTX5000 with WinFast WS830

 

(FP32) Single-precision Performance of QUADRO RTX6000 with WinFast WS830 


(FP16) Half-precision Performance of QUADRO RTX6000 with WinFast WS830


FP32 & FP16 Performance

In this test, we use 2pcs GPU with NVLink interconnect to check the performance of Single-precision (FP32) computing and Half-precision (FP16) computing in AI model training. Half-precision (FP16) computing leverage the power of latest Tensor Cores while Single-precision (FP32) only use CUDA Cores, we can compare the Tensor Cores performance in different AI training models. From the test result shown below, Half-precision (FP16) performance increased 60%~95% running AI models such as VGG16, Inception V4, and ResNet50 which have millions of parameters and many hidden layers. While we use AlexNet and GoogLeNe which have less parameters and hidden layers, FP16 performance increased 25%~55% compared to FP32 computing. 

 

FP32 v.s. FP16 Performance of QUADRO RTX5000 with WinFast WS830

 

FP32 v.s. FP16 Performance of QUADRO RTX6000 with WinFast WS830



WinFast WS1030 Test Result

Multi-GPU Performance

From the test result shown below, we can see that TensorFlow performance through NVLink only stand out in AlexNet instead of other AI models. Other AI models template may not be optimized for multi-GPU data exchange, and it does not mean other AI models cannot benefit from NVLink high-speed data throughput. Take AlexNet AI model for example, the performance of multi-GPU increased 10%~35% with NVLink interconnect. Compared to single GPU application, multi-GPU performance increased 65%~120%. Most AI model performance increased over 85% with multi-GPU application, which mean we can get 2x performance with 2pcs GPU computing compared to single GPU computing.

 

(FP32) Single-precision Performance of QUADRO RTX6000 with WinFast WS1030

 

(FP16) Half-precision Performance of QUADRO RTX6000 with WinFast WS1030 


(FP32) Single-precision Performance of QUADRO RTX8000 with WinFast WS1030

 

(FP16) Half-precision Performance of QUADRO RTX8000 with WinFast WS1030


FP32 & FP16 Performance

In this test, we use 2pcs GPU with NVLink interconnect to check the performance of Single-precision (FP32) computing and Half-precision (FP16) computing in AI model training. Half-precision (FP16) computing leverage the power of latest Tensor Cores while Single-precision (FP32) only use CUDA Cores, we can compare the Tensor Cores performance in different AI training models. From the test result shown below, Half-precision (FP16) performance increased 78%~95% running AI models such as VGG16, Inception V4, and ResNet50 which have millions of parameters and many hidden layers. While we use AlexNet and GoogLeNe which have less parameters and hidden layers, FP16 performance increased 24%~45% compared to FP32 computing. 


FP32 v.s. FP16 Performance of QUADRO RTX6000 with WinFast WS1030


FP32 v.s. FP16 Performance of QUADRO RTX8000 with WinFast WS1030

Comments as following