xFasterTransformer
xFasterTransformer is an advanced solution designed to optimize the performance of large language models (LLMs) on the X86 platform. It is similar in concept to FasterTransformer, which is used on GPU platforms. This tool demonstrates significant effectiveness in distributed environments, enabling it to perform inference on large models across multiple processor sockets and nodes. With support for both C++ and Python APIs, it offers flexible integration options for developers.
Models Overview
As large language models become increasingly prevalent in AI applications, the speed and efficiency of their inference processes are crucial. xFasterTransformer harnesses the full capabilities of Intel's Xeon hardware to deliver top-tier performance and scalability. This optimization extends from single-socket setups to systems with multiple sockets and nodes, ensuring robust support for extensive computational tasks.
Model Support Matrix
xFasterTransformer supports a variety of models, primarily through frameworks like PyTorch and C++. It can effectively handle:
- ChatGLM series
- GLM4
- Llama series
- Baichuan series
- QWen series
- SecLLM, Opt, and several other advanced AI models
DataType Support List
The system supports multiple data types, including FP16, BF16, various INT formats, and combinations thereof, catering to diverse processing needs and enhancing the flexibility of model implementation.
Documentation and Resources
An extensive range of resources is available to assist users in deploying xFasterTransformer. This includes detailed API documentation and a comprehensive wiki to provide both high-level and technical insights into its functionalities. Examples are provided to ease the integration process for both C++ and Python users.
Installation Options
- From PyPI: Easily installable via
pip
, making integration straightforward for Python users. - Using Docker: This method involves pulling an xFasterTransformer-specific image, facilitating deployment in isolated environments.
- Building from Source: Detailed instructions are provided for those who prefer to compile the tool from source code, with options for manual setup and using CMake or Python setup scripts.
Models Preparation
Converting models from the Huggingface format to one supported by xFasterTransformer is straightforward. A conversion tool built into xFasterTransformer helps accomplish this while ensuring compatibility with various model types.
API Usage
Both Python and C++ APIs mimic the familiarity of the transformers library. The Python API supports features such as tokenization and streaming output, making it highly valuable for developers looking to integrate large language models into their solutions. The C++ API provides similar functionality with the addition of sentence piece tokenization.
Running xFasterTransformer
It supports both single and multi-rank modes, ensuring adaptability whether operating in simpler or more complex multi-node environments. The documentation provides guidelines on setting up running environments to optimize performance.
Web Demo and Serving
For demonstration purposes, there is a web-based interface built using Gradio, enabling easy interaction with models like ChatGLM and Llama2. Additionally, xFasterTransformer's compatibility with standard serving frameworks such as vLLM, FastChat, and MLServer ensures it can be integrated into a wide range of deployment systems.
Benchmark and Support
Benchmark tools are available for users to quickly assess the performance of model inference tasks, while support is provided via email and an online community on WeChat, fostering an environment for collaborative development and troubleshooting.
Accepted Papers and Research
xFasterTransformer has been recognized in various academic circles, contributing knowledge on LLM inference optimizations, particularly on CPU architectures. If the project aids in research endeavors, users are encouraged to cite related academic papers.
Q&A
Addresses common queries, such as hardware compatibility, noting that while xFasterTransformer is optimized for Xeon processors, it does not support Intel Core CPUs due to specific instruction set requirements.