ByteTransformer - Achieve Enhanced Transformer Speed with Optimized Inference Techniques

ByteTransformer: Optimized BERT Transformer Inference on NVIDIA GPUs

Introduction

ByteTransformer is a cutting-edge library designed specifically for enhancing the performance of BERT-like transformers during inference. Its creation focuses on high efficiency and providing easy integration for developers through both Python and C++ APIs. One of its standout features is the PyTorch plugin that facilitates improvement in transformer inference with minimal additional Python code.

ByteTransformer supports both fixed-length and variable-length transformers. It offers end-to-end architectural optimizations, notably deploying a padding-free algorithm for BERT routines. These routines encompass crucial elements such as QKV (Query, Key, Value) encoding, softmax functions, feed-forward networks, activation functions, layer normalization, and multi-head attention.

This library is not just theoretical but has been practically applied and is actively enhancing transformer inference systems at ByteDance. It outperforms other existing transformer implementations, delivering impressive performance for both fixed-length and variable-length inputs. The technical prowess of ByteTransformer has been recognized with a publication at the IEEE IPDPS 2023.

How to Cite

Users who incorporate ByteTransformer into their work are encouraged to cite the following research paper:

@article{zhai2022bytetransformer,
  title={ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs},
  author={Zhai, Yujia and Jiang, Chengquan and Wang, Leyuan and Jia, Xiaoying and Zhang, Shang and Chen, Zizhong and Liu, Xin and Zhu, Yibo},
  journal={arXiv preprint arXiv:2210.03052},
  year={2022}
}

Performance and Speedup

ByteTransformer has been benchmarked against several other popular tools including PyTorch, TensorFlow, FasterTransformer, and DeepSpeed using an A100 GPU. The comparative performance evaluation demonstrates significant efficiency improvements in execution time across various batch sizes and sequence lengths. For instance:

With a batch size of 1 and varying sequence lengths, ByteTransformer consistently achieves faster execution compared to alternatives.
Similarly, with a batch size of 16, ByteTransformer again shows superior performance metrics.

This enhanced efficiency translates into reduced computational costs and faster processing times, making ByteTransformer a preferred choice for high-demand inference tasks.

Supported Models

As of now, ByteTransformer specifically supports the standard BERT transformer encoder model, providing a focused optimization path for this widely used architecture.

Environment Requirements

To use ByteTransformer, the following environment settings are necessary:

CUDA version 11.6
CMake version 3.13 or higher
PyTorch version 1.8 or higher
GPU compute capability required: 7.0 (V100), 7.5 (T4), or 8.0 (A100)
Python version 3.7 or higher

The library has been tested on systems equipped with A100 GPUs, CUDA 11.6, PyTorch 1.13.0+cu116, and Python 3.9.16.

Building from Source

Building ByteTransformer from source is a straightforward process. Users need to initialize the necessary submodules, prepare a build directory, configure with CMake, and compile. The commands are as follows:

git submodule update --init
mkdir build && cd build
cmake -DTORCH_CUDA_ARCH_LIST="8.0" -DDataType=FP16 -DBUILD_THS=ON -DCUDAARCHS="80" ..
make

Getting Started with Unit Tests

Unit Tests in C++

Users can generate test data using the provided Python script. This step involves specifying parameters like batch size, sequence length, the number of attention heads, and head size:

cd build
python3 bert_transformer_test.py 16 64 12 64 --avg_seqlen 32 --dtype fp16 --export_data

After generating the test data, users can run a test executable to verify the library:

./bin/bert_transformer_test 16 64 12 64

Unit Tests in a PyTorch Plugin in Python

For those preferring to test via Python directly, the same script can be executed without the --export_data flag:

python3 bert_transformer_test.py 16 64 12 64 --avg_seqlen 32 --dtype fp16

Benchmarking

To assess the performance of ByteTransformer, users can run the following benchmark script:

cd build
../benchmark/bert_bench.sh

Through its high-performance capabilities and easy integration, ByteTransformer stands as a significant tool for enhancing transformer inference tasks on NVIDIA GPUs.