FasterTransformer - Efficient Transformer Layers for Enhanced GPU Inference Performance

Project Introduction: FasterTransformer

In the realm of natural language processing (NLP), the FasterTransformer stands out as a significant contribution by NVIDIA. This project offers a highly optimized version of the transformer model's encoder and decoder components, crucial for language model inference. While development has now shifted to a newer project, TensorRT-LLM, FasterTransformer still represents a substantial piece of technology designed for efficiency and speed.

Model Overview

Transformers are a cornerstone of modern NLP architectures, utilized for tasks such as translation, summarization, and more. FasterTransformer is built to enhance these tasks by optimizing the transformer layers for better performance. This optimization is particularly efficient on GPUs like NVIDIA's Volta, Turing, and Ampere, where the Tensor Cores are leveraged efficiently when using FP16 precision.

Support for FasterTransformer extends across multiple frameworks—TensorFlow, PyTorch, and even Triton backend—allowing easy integration and use of its features in various machine learning environments. This versatility makes it an accessible choice for developers looking to enhance their NLP model performance.

Support Matrix

FasterTransformer supports a wide array of models, each with specific capabilities across multiple frameworks:

BERT and XLNet: These models are supported across TensorFlow, PyTorch, and C++, with some frameworks offering additional features like INT8 precision or sparsity optimizations.
GPT, BLOOM, and other language models: Running mostly on PyTorch and the Triton backend, these models also enjoy enhanced support for tensor and pipeline parallelism.
Vision Transformers: Swin Transformers and Vision Transformers (ViT) leverage TensorRT for improved computation efficiency.

The broad support across models and frameworks highlights the adaptability of FasterTransformer to fit various needs in model inference.

Advanced Features

The directory structure of FasterTransformer is quite comprehensive, categorizing different modules like CUDA kernels, layer implementations, and more to aid in model development and deployment. Importantly, it provides custom operations (OPs) for TensorFlow and PyTorch, allowing developers to extend and customize functionalities to meet specific needs.

Global Environment and Debugging

FasterTransformer aids developers with several environment settings that improve testing and debugging processes:

Log Level Control: Developers can set the log level to manage the verbosity of debug messages.
NVTX Profiling: Enables integrations for time-stamping operations, aiding performance profiling.
Debug Level: Synchronization settings help identify errors during debugging, albeit at a potential performance cost.

Performance

The performance of FasterTransformer is illustrated through extensive benchmarking. For example, it improves both small and large-scale BERT base model performances significantly across both TensorFlow and PyTorch. Notable speedups, reaching up to 6x, are achieved with configurations utilizing FP16 precision or INT8 quantizations.

BERT Performance: Outperforms typical TensorFlow and PyTorch setups significantly, especially for large batch sizes.
Decoding Speeds: Demonstrates significant gains in decoding tasks, particularly when using optimized FP16 structures.
GPT: Shows how well-optimized FasterTransformer can handle large-scale models efficiently, using parameters suited for high-performance scenarios.

Release Notes

FasterTransformer has evolved through many versions, each adding capabilities and optimizing previous offerings:

FP8 Support: Recently, FP8 support for Bert and GPT has been added, marking a major advancement.
Decoding Optimization: Introduction of techniques such as beam search and sampling has improved translation performance.
Compatibility Enhancements: Extensive work on compatibility with multi-GPU and multi-node setups ensures scalability of models.

The development history notes numerous enhancements and bug fixes, showing a clear trajectory of improvement aimed at meeting advanced NLP challenges.

In conclusion, while the development focus has shifted toward TensorRT-LLM, the vast improvements and functional support offered by FasterTransformer mark it as a pivotal tool for improving transformer-based model inferences, offering significant performance enhancements tailored for today's demanding NLP tasks.