ThunderKittens
ThunderKittens streamlines creating high-performance deep learning kernels with CUDA, soon supporting MPS and ROCm. It focuses on simplicity, extensibility, and performance to optimize tile manipulation specific to modern GPU architectures. Key features include tensor core optimization, asynchronous copy techniques to reduce latency, and distributed shared memory usage for efficient GPU usage. Supporting CUDA 12.3+ and C++20, ThunderKittens is powerful yet straightforward to incorporate, offering pre-built PyTorch kernels and an active developer community.