[ Sharing ]  Review!RTX A6000 Can Do More Than You Imagine!
  Comments:

Review!RTX A6000 Can Do More Than You Imagine!

  By : Leadtek AI Expert     3881

Leadtek launch NVIDIA RTX A6000 Ampere professional graphics.

Unlock the next generation scientific breakthroughs.

Let's see what RTX A6000 can bring to you in AI applications. 


See More Product Detail Here



Software and Hardware Configurations

Item

Specifications

Barebone

WinFast GS4845

CPU

AMD EPYC 7262*2 (8 Cores, 3.2GHz per CPU)

RAM

Samsung 3200MHz 32GB*32 (1TB)

OS

Ubuntu 18.04.2 LTS

driver

460.27.04

docker

19.03.13

nvidia-docker

2.5.0

Framework

NGC TensorFlow 20.12-tf1-py3 for classification model

NGC PyTorch 20.06 for detection model

NGC TensorFlow 20.12-tf2-py3 for segmentation model

NGC PyTorch 20.12 for translation model


Test Item

Item

Descriptions

NVIDIA Ampere GPU performance

Test the performance of RTX A6000 in image classification model training

Comparison of Turing GPU and Ampere GPU

Compare the training performance difference between a single A6000 and two RTX 6000 (using NVLink) in each model

Model training comparison

Compare the training performance difference of RTX A6000, RTX 6000 and RTX 8000 in each model

Model inference comparison

Compare the performance difference of RTX A6000, RTX 6000 and RTX 8000 in various model inference


Test Introduction

The following is a performance evaluation description for each test item. Each value is the average result of at least 100 iterations when the system is not performing other tasks. However, owing to the deep learning model architecture and current system operating conditions, there are still some errors in the presented data. Therefore, the data in this report is only provided as a reference basis for the relative performance comparison of various GPUs. The vertical axis of each graph in the article represents the performance of deep learning operations, and the comparison is based on the number of images that can be processed per second by various models (horizontal axis) (or the number of tokens that can be processed per second in the translation model).


NVDIA Ampere GPU Performance

From the performance test of deep learning image classification, NVLink's benefit to NVIDIA RTX A6000 is not obvious. Figure 1 shows the overall performance of the Ampere architecture GPU NVIDIA RTX A6000. The Ampere architecture includes support for the latest deep learning library cuDNN 8, and is equipped with the third-generation Tensor Core. It has become increasingly integrated in the deep learning framework. In addition to the major improvement in the mixed precision (FP16) operation performance, single-precision operation can also use TF32 by default, which has slightly improved performance compared to traditional FP32. The performance of half-precision operation is about 40% to 155% higher than that of single-precision operation, and the performance varies according to the model architecture.


Figure 1 Performance summary of NVIDIA RTX A6000 deep learning image classification


If Figure 1 is subdivided into "Single GPU vs. Multi-GPU Comparison" or "Multi-GPU Performance Improvement Comparison with NVLink", see Figure 2 for the former and Figure 3 for the latter.


Two GPUs can increase the performance by 86% to 97% over a single GPU. Besides image classification, object detection model (SSD) performance is increased by about 95% to 100%, image segmentation model (Mask R-CNN) is improved by approximately 85% to 92%, and the translation model (GNMT) performance is improved by approximately 72% to 81% (see to Table 1).


Table 1 Performance improvement percentage of two Ampere architecture GPUs compared with a single GPU on various models

GPU

Image classification

Object detection

Image segmentation

Translation

RTX A6000

86%~97%

95%~100%

85%~92%

72%~81%



Figure 2 Performance comparison of NVIDIA RTX A6000 with multi-GPU on deep learning image classification 


As to the NVLink influence, the performance of the NVIDIA RTX A6000 has increased by about 1% to 5% after the NVLink bridge is installed. The overall performance improvement is not obvious.


This article also compares other types of deep learning models. As shown in Figure 4, NVLink does not significantly improve the performance of the object detection model (SSD), and the performance of the image segmentation model (Mask R-CNN) is improved by about 2% to 7%. The translation model has a relatively greater performance improvement, about 12% to 15% (see Table 2).


Table 2 Performance improvement of Ampere architecture GPU with NVLink on various models

GPU

Image classification

Object detection

Image segmentation

Translation

RTX A6000

1%~5%

0%

2%~7%

12%~15%


Figure 3 Performance comparison of NVIDIA RTX A6000 deep learning image classification with NVLink 


 

Figure 4 NVLink performance comparison of Ampere series GPUs in deep learning and other models



Comparison between Turing GPU and Ampere GPU

The Turing architecture adro RTX 6000 has 24GB GDDR RAM, while the Ampere architecture NVIDIA RTX A6000 has 48GB GDDR RAM (refer to Table 3). If two RTX 6000 are used for distributed deep learning model training, how do they fare against the RTX A6000? Figure 5 shows the image classification (VGG16 and ResNet50), object detection (SSD), image segmentation (Mask R-CNN), and translation (GNMT) models. See Table 4, except for the translation model in which the performance of a single NVIDIA RTX A6000 is 6% to 57% higher than that of two Quadro RTX 6000s, in all the other image recognition models the performance   of two Quadro RTX 6000s is about 10% to 20% higher.


Table 3 Comparison of NVIDIA RTX A6000 and Quadro RTX 6000 specifications

Spec

NVIDIA RTX A6000

Quadro RTX 6000

CUDA Cores

10752

4608

GPU Memory

48GB GDDR6

24GB GDDR6

FP32 Performance

38.7 TFLOPS

16.3 TFLOPS

TF32 Performance

77.4 Tensor TFLOPS

NA

Deep Learning Performance

154.9 Tensor TFLOPS

130.5 Tensor TFLOPS


Table 4 Performance improvement of a single NVIDIA RTX A6000 over two Quadro RTX 6000s in deep learning

RTX A6000 

Image classification

Object detection

Image segmentation

Translation

Performance Improvement %

-20%~-10%

-20%~-13%

-21%~-15%

6%~57%


Figure 5 Deep learning performance comparison between NVIDIA RTX A6000 and Quadro RTX 6000



Model Training Comparison

From the comparison of the two generations of high-end GPUs in single GPU performance of model training (Figure 6), the single-precision operation (Ampere TF32, Turing FP32) performance of Ampere architecture GPU is about 36% to 130% higher (see Table 5) than that of the Turing-based GPU, and the half-precision operation is 1% to 46% higher.


Figure 6 Training performance comparison of the two generations of GPUs in various models



Table 5 The average performance improvement percentage of Ampere architecture GPU over Turing-based GPU

RTX A6000

Image classification

(VGG16)

Object detection

(SSD)

Image segmentation

(Mask R-CNN)

Translation

(GNMT)

TF32/FP32

55%

36%

45%

130%

FP16

35%

46%

13%

27%



Model Inference Comparison

To summarize the comparison of the Turing architecture and Ampere architecture high-end GPUs in the model inference performance with single GPU (Figure 7), when the Ampere architecture GPU is compared with the Turing-based GPU, the single-precision operation (Ampere TF32, Turing FP32) performance is about 32% to 125% higher (see to Table 6), and the half-precision operation is 18% to 33% higher.

 

Figure 7 Inference performance comparison of the two generations of GPUs in various models



Table 6 The average performance improvement percentage of Ampere architecture GPU over Turing-based GPU

RTX A6000

Object detection

(SSD)

Image segmentation

(Mask R-CNN)

Translation

(GNMT)

TF32/FP32

32%

40%

125%

FP16

33%

18%

24%







Comments as following