accelerated-scan
The Accelerated Scan project implements efficient GPU-based forward and backward associative scans, improving the processing of first-order recurrences, particularly in state space models and linear RNNs. It utilizes a C++ CUDA kernel for chunked processing and takes advantage of advanced GPU communication techniques like warp shuffling and shared memory use. Implementations are available in both CUDA and Triton, ensuring faster performance with maintained numerical accuracy. Benchmarks highlight notable improvements over conventional methods, making it a suitable option for developers requiring dependable associative scanning capabilities.