[ENG] Rapids Introduction and Benchmark
RAPIDS is NVIDIA's GPU acceleration platform for data science and machine learning. It optimizes the calculations through the CUDA acceleration library, allowing users to easily use the GPU's computing resources and provide Python with jupyter interface. Data can be visualized for analysis. Built on popular open source projects such as Apache Arrow, pandas, and scikit-learn, RAPIDS brings GPU power to the most popular Python data science toolchain.
NVIDIA has also worked with many open-source contributors to introduce more machine learning methods to RAPIDS, such as Anaconda, BlazingDB, Databricks, Quansight and scikit-learn, among which creators of Apache Arrow and Pandas (Python Data Science Library) are also involved, importing the data format of Apache Arrow into the RAPIDS development environment, greatly improving the efficiency of data transfer and analysis. NVIDIA is also trying to integrate RAPIDS with Apache Spark.
Apache Arrow
Apache Arrow advantage. Image source
Apache Arrow is a "memory format" for memory data exchange for cross-platform applications, allowing users to combine data between different systems more efficiently. Traditional memory data is stored in rows and columns. The system has its own memory format, which wastes a lot of time on serialization and deserialization in data exchange, while Apache Arrow uses a columnar storage format to significantly improve the efficiency of the system and application.
Apache Arrow storage architecture. Image source
Apache Arrow's memory data structure diagram. The traditional memory data is stored in row and column format (as shown on the left). Apache Arrow stores data in column format; align the same data into a group.
cuDF
GPU DataFrame library
cuDF is a GPU DataFrame library, which stores data in DataFrame format and is used for data analysis and calculation. It is similar to pandas in Python, and can read, filter, aggregate data, etc. In addition, cuDF is developed based on Apache Arrow memory format, which can improve the efficiency of data transfer and GPU operations, and easily convert other data sources (such as Pandas, spark, csv, etc.) into cuDF data format, or convert cuDF to other formats. Through the open source JIT Compiler(run-time Compiler): Numba, cuDF also provides the function of parallelizing data. Users can use the wrapper function provided by cuDF to easily calculate the data in parallel, and users experienced in kernel can also use cuDF DataFrame and Numba to write the kernel function for calculation.
DataFrame is a two-dimensional data structure (as shown above). You can use different instructions to view and use the desired data (such as: data size, length, number of nulls, etc.). The data format can be used for machine learning operations. Although it is the same as the cuDF DataFrame in form and use, cuDF has extra processing time because it needs to declare space on the GPU. In the default memory management mode, the size of the cuDF DataFrame depends on the GPU Memory, which is usually smaller than Pandas DataFrame.
cuML
cuML is the machine learning and scientific computing library of RAPIDS. It provides API similar to scikit-learn, and users can also use GPU to accelerate operations without understanding GPU related programs. When cuML is used, it usually needs to work with cuDF DataFrame to accelerate data entering GPU, to simplify the transfer instructions between Host and Device. cuML algorithm can be divided into three categories: Regression and Classification, Clustering, and Dimensionality Reduction. The algorithm is written in C language, and connected through cython. Currently some parts of cuML are still in development. The Rapids team also works closely with the DMLC XGBoost organization, and the XGBoost approach is supported in the RAPIDS environment.
This is co-developed by NVIDIA and DASK, and multi-GPU multi-node technology will be implemented in machine learning later. It is expected that most methods can be supported in the RAPIDS version 1.0.
Benchmarks - Official Test
Performance comparison between Spark and RAPIDS. The vertical axis is the execution time (seconds). Image source
Compare the end-to-end processes of using CPU and using RAPIDS, it can be seen that RAPIDS not only reduces model training time, but also significantly reduces data conversion time.
Speedup comparison between cuML and sklearn using three methods with different data sizes. Image source
The difference in performance between TSVD, PCA, and DBSCAN training with cuML and sklearn shows that the larger the data size, the more significant the performance improvement. Similarly, the more variables are processed, the more effective the performance is.
Benchmarks - Actual Test
Test system specifications: WinFast WS830 (Intel Xeon W-2135, 128GB RAM, GTX 1070 8GB)
Software specifications: Ubuntu 18.04, CUDA 10.0, cuDNN 7.5
Data reading speed comparison between cuDF (RAPIDS) and pandas
(horizontal axis is the data size, and vertical axis is the execution time (seconds))
Performance comparison between GPU and CPU (12 cores) in XGBoost
(horizontal axis is the data size, and vertical axis is the execution time (seconds))
Performance comparison between GPU and CPU (12 core) in RandomForest
(horizontal axis is the data size, and vertical axis is the execution time (seconds))
Performance comparison between GPU and CPU in PCA
(horizontal axis is the data size, and vertical axis is the execution time (seconds))
Performance comparison between GPU and CPU in K-means
(horizontal axis is the data size, and vertical axis is the execution time (seconds))