Review！RTX A6000 Can Do More Than You Imagine！
Leadtek launch NVIDIA RTX A6000 Ampere professional graphics.
Unlock the next generation scientific breakthroughs.
Let's see what RTX A6000 can bring to you in AI applications.
Software and Hardware Configurations
AMD EPYC 7262*2 (8 Cores, 3.2GHz per CPU)
Samsung 3200MHz 32GB*32 (1TB)
Ubuntu 18.04.2 LTS
NGC TensorFlow 20.12-tf1-py3 for classification model
NGC PyTorch 20.06 for detection model
NGC TensorFlow 20.12-tf2-py3 for segmentation model
NGC PyTorch 20.12 for translation model
NVIDIA Ampere GPU performance
Test the performance of RTX A6000 in image classification model training
Comparison of Turing GPU and Ampere GPU
Compare the training performance difference between a single A6000 and two RTX 6000 (using NVLink) in each model
Model training comparison
Compare the training performance difference of RTX A6000, RTX 6000 and RTX 8000 in each model
Model inference comparison
Compare the performance difference of RTX A6000, RTX 6000 and RTX 8000 in various model inference
The following is a performance evaluation description for each test item. Each value is the average result of at least 100 iterations when the system is not performing other tasks. However, owing to the deep learning model architecture and current system operating conditions, there are still some errors in the presented data. Therefore, the data in this report is only provided as a reference basis for the relative performance comparison of various GPUs. The vertical axis of each graph in the article represents the performance of deep learning operations, and the comparison is based on the number of images that can be processed per second by various models (horizontal axis) (or the number of tokens that can be processed per second in the translation model).
NVDIA Ampere GPU Performance
From the performance test of deep learning image classification, NVLink's benefit to NVIDIA RTX A6000 is not obvious. Figure 1 shows the overall performance of the Ampere architecture GPU NVIDIA RTX A6000. The Ampere architecture includes support for the latest deep learning library cuDNN 8, and is equipped with the third-generation Tensor Core. It has become increasingly integrated in the deep learning framework. In addition to the major improvement in the mixed precision (FP16) operation performance, single-precision operation can also use TF32 by default, which has slightly improved performance compared to traditional FP32. The performance of half-precision operation is about 40% to 155% higher than that of single-precision operation, and the performance varies according to the model architecture.
Figure 1 Performance summary of NVIDIA RTX A6000 deep learning image classification
If Figure 1 is subdivided into "Single GPU vs. Multi-GPU Comparison" or "Multi-GPU Performance Improvement Comparison with NVLink", see Figure 2 for the former and Figure 3 for the latter.
Two GPUs can increase the performance by 86% to 97% over a single GPU. Besides image classification, object detection model (SSD) performance is increased by about 95% to 100%, image segmentation model (Mask R-CNN) is improved by approximately 85% to 92%, and the translation model (GNMT) performance is improved by approximately 72% to 81% (see to Table 1).
Table 1 Performance improvement percentage of two Ampere architecture GPUs compared with a single GPU on various models
Figure 2 Performance comparison of NVIDIA RTX A6000 with multi-GPU on deep learning image classification
As to the NVLink influence, the performance of the NVIDIA RTX A6000 has increased by about 1% to 5% after the NVLink bridge is installed. The overall performance improvement is not obvious.
This article also compares other types of deep learning models. As shown in Figure 4, NVLink does not significantly improve the performance of the object detection model (SSD), and the performance of the image segmentation model (Mask R-CNN) is improved by about 2% to 7%. The translation model has a relatively greater performance improvement, about 12% to 15% (see Table 2).
Table 2 Performance improvement of Ampere architecture GPU with NVLink on various models
Figure 3 Performance comparison of NVIDIA RTX A6000 deep learning image classification with NVLink
Figure 4 NVLink performance comparison of Ampere series GPUs in deep learning and other models
Comparison between Turing GPU and Ampere GPU
The Turing architecture adro RTX 6000 has 24GB GDDR RAM, while the Ampere architecture NVIDIA RTX A6000 has 48GB GDDR RAM (refer to Table 3). If two RTX 6000 are used for distributed deep learning model training, how do they fare against the RTX A6000? Figure 5 shows the image classification (VGG16 and ResNet50), object detection (SSD), image segmentation (Mask R-CNN), and translation (GNMT) models. See Table 4, except for the translation model in which the performance of a single NVIDIA RTX A6000 is 6% to 57% higher than that of two Quadro RTX 6000s, in all the other image recognition models the performance of two Quadro RTX 6000s is about 10% to 20% higher.
Table 3 Comparison of NVIDIA RTX A6000 and Quadro RTX 6000 specifications
NVIDIA RTX A6000
Quadro RTX 6000
77.4 Tensor TFLOPS
Deep Learning Performance
154.9 Tensor TFLOPS
130.5 Tensor TFLOPS
Table 4 Performance improvement of a single NVIDIA RTX A6000 over two Quadro RTX 6000s in deep learning
Performance Improvement %
Figure 5 Deep learning performance comparison between NVIDIA RTX A6000 and Quadro RTX 6000
Model Training Comparison
From the comparison of the two generations of high-end GPUs in single GPU performance of model training (Figure 6), the single-precision operation (Ampere TF32, Turing FP32) performance of Ampere architecture GPU is about 36% to 130% higher (see Table 5) than that of the Turing-based GPU, and the half-precision operation is 1% to 46% higher.
Figure 6 Training performance comparison of the two generations of GPUs in various models
Table 5 The average performance improvement percentage of Ampere architecture GPU over Turing-based GPU
Model Inference Comparison
To summarize the comparison of the Turing architecture and Ampere architecture high-end GPUs in the model inference performance with single GPU (Figure 7), when the Ampere architecture GPU is compared with the Turing-based GPU, the single-precision operation (Ampere TF32, Turing FP32) performance is about 32% to 125% higher (see to Table 6), and the half-precision operation is 18% to 33% higher.
Figure 7 Inference performance comparison of the two generations of GPUs in various models
Table 6 The average performance improvement percentage of Ampere architecture GPU over Turing-based GPU