CUDA GEMM Optimization
Introduction
The CUDA GEMM Optimization project focuses on efficient and high-performance matrix-matrix multiplication (GEMM) using CUDA, a parallel computing platform by NVIDIA. The provided repository contains CUDA kernel implementations specifically designed to perform these matrix operations, ensuring consistent correctness across varying matrix sizes. The project's CUDA kernels are optimized for 4096 x 4096 x 4096 matrix operations on NVIDIA's GeForce RTX 3090 GPU, although they are expected to be compatible with any NVIDIA GPU possessing a compute capability of 7.0 or greater.
Usages
The project makes use of Docker to streamline the building and running of CUDA kernels, providing an isolated and consistent environment for development and testing. The Docker container is custom-built on top of the NVIDIA NGC CUDA 12.2.2 container image.
Build Docker Images
To begin using the Docker container, users need to construct a custom Docker image, which can be accomplished with the following terminal command:
$ docker build -f docker/gemm-cuda.Dockerfile --no-cache --tag=gemm-cuda:12.2.2 .
This command will compile and prepare the necessary Docker environment to run the CUDA kernels.
Run Docker Container
Next, to execute the Docker container with GPU support, users should run:
$ docker run -it --rm --gpus device=0 -v $(pwd):/mnt gemm-cuda:12.2.2
Additional flags such as --cap-add=SYS_ADMIN
and --security-opt seccomp=unconfined
may be added if profiling with NVIDIA Nsight Compute is needed.
Build CUDA Kernels
Within the Docker container, the CUDA kernels can be built by executing:
$ cmake -B build
$ cmake --build build --config Release --parallel
$ cmake --install build
These commands ensure the creation of the build environment, followed by concurrent compilation and installation of the built components.
Run CUDA Kernels
To run the floating-point GEMM CUDA kernels for both FP32 and FP16 data types, execute:
$ ./build/src/profile_cuda_gemm_fp32
$ ./build/src/profile_cuda_gemm_fp16
These executions will perform the matrix-multiplication tasks and provide performance outputs.
Performances
The project's performance metrics were obtained using an NVIDIA GeForce RTX 3090 GPU. Results may vary slightly, sometimes by as much as 25%, due to system noise and measurement inconsistencies.
FP32 GEMM
For FP32 GEMM tasks, most custom kernels do not utilize NVIDIA Tensor Cores. Performance, measured in TFLOPS, varies across different kernel implementations. For example, a cuBLAS-native kernel achieved around 24.6 TFLOPS, while other custom versions ranged from 0.27 to 20.16 TFLOPS, depending on the optimization strategies such as memory access patterns, tiling techniques, and vectorization.
FP16 GEMM
In the FP16 GEMM benchmarks, more advanced kernels exploited NVIDIA Tensor Cores, leading to a substantial increase in performance, with the cuBLAS kernel reaching nearly 139 TFLOPS. Custom implementations achieved peaks from 0.28 TFLOPS to over 55 TFLOPS, showcasing improvements with techniques like tiling and matrix transposition.
References
For further in-depth reading and technical details, please refer to the article on CUDA Matrix Multiplication Optimization.
This covers the methodologies and enhancements applied in optimizing the CUDA kernels for GEMM operations.