ThunderKittens - Enhance Deep Learning with Efficient CUDA and Upcoming MPS and ROCm Support

ThunderKittens: A Fast Deep Learning Framework

Overview

ThunderKittens is an innovative framework designed to simplify the creation of fast deep learning kernels using CUDA. It will also support future technologies like MPS and ROCm. The project revolves around three main principles:

Simplicity: The design makes it incredibly easy to write.
Extensibility: The framework is flexible, allowing users to build and expand beyond its offerings without hindrance.
Speed: The kernels are optimized for efficiency, aiming to match or surpass the speed of custom-built solutions. ThunderKittens' implementation of Flash Attention 3 serves as a testament to its capability.

Modern GPUs prefer working with small data tiles, typically around 16x16 values. ThunderKittens constructs its framework to accommodate this, harnessing the power of GPUs to achieve high performance.

Key Features

Tensor Cores: It can leverage tensor core functions, which makes it particularly efficient on H100 GPUs.
Shared Memory Management: The framework cleverly manages shared memory, minimizing common issues such as bank conflicts.
Optimized I/O: ThunderKittens uses techniques like asynchronous copies to effectively handle data loading and storage.
Distributed Memory Utilization: The framework is modernized beyond traditional memory usage to harness the potential of L2 cache.
Worker Overlapping: It facilitates a Load-Store-Compute-Finish template to ensure efficient overlapping of work tasks and I/O processes.

Implementation Example

Let's delve into an example to illustrate how ThunderKittens operates. Here's a snippet of a simple FlashAttention-2 kernel for an RTX 4090 using ThunderKittens:

// Sample CUDA code provided in ThunderKittens documentation

This example achieves approximately 155 TFLOPs on an RTX 4090, which is an impressive 93% of the theoretical maximum performance with less than 100 lines of code.

Installation and Setup

Getting Started: ThunderKittens requires minimal setup — just clone the repository and include the kittens.cuh header file.

Requirements:

CUDA version 12.3 or above is necessary to ensure optimal function, particularly for avoiding bugs with earlier versions.
C++20 is extensively used; outdated compilers might cause issues.

Installation Steps:

Download necessary compilers using apt package commands.
Set environment variables to point to the correct CUDA version.

Exploring Demos and Kernel Installs

ThunderKittens includes a variety of pre-built kernels accessible via the kernels/ directory. To use them:

Adjust environment variables as needed.
Select necessary kernels in the configs.py file.
Run python setup.py install to set up.

There's a set of demonstrations available, illustrating the use of ThunderKittens kernels for different applications such as training and LLM inference. Community feedback and additional contributions, such as new kernels or features, are encouraged.

Testing the Framework

ThunderKittens comes with a comprehensive test suite to ensure everything is working correctly. Simply running make -j in the tests folder will verify the installation, though it may temporarily utilize significant computer resources due to extensive kernel compilation.

ThunderKittens Manual

Even though ThunderKittens is a compact library, it requires some understanding of NVIDIA's programming model to maximize its use. The framework operates at various levels like thread, warp, and block, each with different capabilities. ThunderKittens provides powerful operations and typified functions that help maintain efficiency and correctness in programming. It emphasizes strong typing to prevent incorrect operations and adheres to specific restrictions for function operations.

Community and Further Learning

To engage further with ThunderKittens, learning resources include new research papers and blog posts detailing developments. The project encourages community interaction via the ThunderKittens channel on GPU Mode Discord, offering opportunities to participate in ongoing discussions and enhancements. The project aims to expand its capabilities continuously, encouraging contributions from the community to explore new hardware support, enhance existing kernels, and innovate upon the existing framework capabilities.