CUDA-GEMM-Optimization
Explore performance enhancement methods for GEMM using CUDA kernels optimized for NVIDIA GPUs, specifically the GeForce RTX 3090. The project ensures compatibility with GPUs with compute capability 7.0 or above, using the NVIDIA NGC CUDA Docker container for efficient build and execution. Utilize techniques like 2D block tiling and vectorized memory access to optimize FP32 and FP16 calculations with or without Tensor Cores for significant performance gains.