RTX AI Workstation Performance in DL Application
Data is fundamentally changing the way companies do business, driving demand for data scientists and increasing the complexity in their workflows. Leadtek introduces a purpose-built workstation that helps data scientists transform massive amounts of information into insights faster than ever before by accelerating data preparation, model training and visualization.
WinFast RTX AI Workstation is a turnkey system that combines the power of the world’s most advanced Quadro GPU with accelerated CUDA-X AI data science software to deliver a new breed of fully-integrated workstations to ensure maximum compatibility and reliability for data science.
In the following, we will show you the performance results in 2 different Leadtek RTX AI Workstations: (Mid-Range) WinFast WS830 and (High-End) WinFast WS1030.
All tested workstations use TensorFlow 1.12 deep learning framework to implement image processing with ImageNet dataset.
The 1st purpose is to demonstrate the AI application performance (Images/Sec) of single-GPU and multi-GPU. Higher scores in (Images/Sec) result means better performance.
The 2nd purpose is to compare the performance of half-precision (FP16) floating points and single-precision (FP32) floating points with multi-GPU application. GPU will activate Tensor Cores for half-precision (FP16) calculations while CUDA Cores is used for single-precision (FP32) calculations. The purpose of this test is to check the Tensor Cores performance of RTX GPU during AI model training.
Platform Specification
Workstation | WinFast WS830 | WinFast WS1030 |
---|---|---|
CPU | Intel Xeon W-2123 *1 | Intel Xeon Gold 5122 *2 |
RAM | 2666MHz 32GB *4 | 2666MHz 32GB *6/td> |
OS | Ubuntu 18.04 LTS | Ubuntu 18.04 LTS |
Driver | 410.78 | 410.78 |
Docker | 18.09 | 18.09 |
nvidia-docker | 2.0 | 2.0 |
Framework | TensorFlow 1.12 | TensorFlow 1.12 |
Test List
Workstation | Test Item | Test Target |
---|---|---|
WinFast 830 | Multi-GPU Performance | Compare AI training performance between Single GPU and Multi-GPU. - RTX5000 - RTX6000 |
FP32 & FP16 Precision Performance | Compare AI training performance between Single-precision (FP32) floating points and Half-precision (FP16) floating points. - RTX5000 - RTX6000 | |
WinFast 1030 | Multi-GPU Performance | Compare AI training performance between Single GPU and Multi-GPU. - RTX6000 - RTX8000 |
FP32 & FP16 Precision Performance | Compare AI training performance between Single-precision (FP32) floating points and Half-precision (FP16) floating points. - RTX6000 - RTX8000 |
WinFast WS830 Test Result
Multi-GPU Performance
Carrying 900watts power supply, WinFast WS830 supports up to 2pcs Quadro RTX5000 (and above) professional graphics. From the test result shown below, we can see that TensorFlow performance through NVLink only stand out in AlexNet instead of other AI models. Other AI models template may not be optimized for multi-GPU data exchange, and it does not mean other AI models cannot benefit from NVLink high-speed data throughput. Take AlexNet AI model for example, the performance of multi-GPU increased 10%~30% with NVLink interconnect. Compared to single GPU application, multi-GPU performance increased 65%~110%. Most AI model performance increased over 85% with multi-GPU application, which mean we can get 2x performance with 2pcs GPU computing compared to single GPU computing.
(FP32) Single-precision Performance of QUADRO RTX5000 with WinFast WS830
(FP16) Half-precision Performance of QUADRO RTX5000 with WinFast WS830
(FP32) Single-precision Performance of QUADRO RTX6000 with WinFast WS830
(FP16) Half-precision Performance of QUADRO RTX6000 with WinFast WS830
FP32 & FP16 Performance
In this test, we use 2pcs GPU with NVLink interconnect to check the performance of Single-precision (FP32) computing and Half-precision (FP16) computing in AI model training. Half-precision (FP16) computing leverage the power of latest Tensor Cores while Single-precision (FP32) only use CUDA Cores, we can compare the Tensor Cores performance in different AI training models. From the test result shown below, Half-precision (FP16) performance increased 60%~95% running AI models such as VGG16, Inception V4, and ResNet50 which have millions of parameters and many hidden layers. While we use AlexNet and GoogLeNe which have less parameters and hidden layers, FP16 performance increased 25%~55% compared to FP32 computing.
FP32 v.s. FP16 Performance of QUADRO RTX5000 with WinFast WS830
FP32 v.s. FP16 Performance of QUADRO RTX6000 with WinFast WS830
WinFast WS1030 Test Result
Multi-GPU Performance
From the test result shown below, we can see that TensorFlow performance through NVLink only stand out in AlexNet instead of other AI models. Other AI models template may not be optimized for multi-GPU data exchange, and it does not mean other AI models cannot benefit from NVLink high-speed data throughput. Take AlexNet AI model for example, the performance of multi-GPU increased 10%~35% with NVLink interconnect. Compared to single GPU application, multi-GPU performance increased 65%~120%. Most AI model performance increased over 85% with multi-GPU application, which mean we can get 2x performance with 2pcs GPU computing compared to single GPU computing.
(FP32) Single-precision Performance of QUADRO RTX6000 with WinFast WS1030
(FP16) Half-precision Performance of QUADRO RTX6000 with WinFast WS1030
(FP32) Single-precision Performance of QUADRO RTX8000 with WinFast WS1030
(FP16) Half-precision Performance of QUADRO RTX8000 with WinFast WS1030
FP32 & FP16 Performance
In this test, we use 2pcs GPU with NVLink interconnect to check the performance of Single-precision (FP32) computing and Half-precision (FP16) computing in AI model training. Half-precision (FP16) computing leverage the power of latest Tensor Cores while Single-precision (FP32) only use CUDA Cores, we can compare the Tensor Cores performance in different AI training models. From the test result shown below, Half-precision (FP16) performance increased 78%~95% running AI models such as VGG16, Inception V4, and ResNet50 which have millions of parameters and many hidden layers. While we use AlexNet and GoogLeNe which have less parameters and hidden layers, FP16 performance increased 24%~45% compared to FP32 computing.
FP32 v.s. FP16 Performance of QUADRO RTX6000 with WinFast WS1030
FP32 v.s. FP16 Performance of QUADRO RTX8000 with WinFast WS1030