xFasterTransformer

xFasterTransformer is an advanced solution designed to optimize the performance of large language models (LLMs) on the X86 platform. It is similar in concept to FasterTransformer, which is used on GPU platforms. This tool demonstrates significant effectiveness in distributed environments, enabling it to perform inference on large models across multiple processor sockets and nodes. With support for both C++ and Python APIs, it offers flexible integration options for developers.

Models Overview

As large language models become increasingly prevalent in AI applications, the speed and efficiency of their inference processes are crucial. xFasterTransformer harnesses the full capabilities of Intel's Xeon hardware to deliver top-tier performance and scalability. This optimization extends from single-socket setups to systems with multiple sockets and nodes, ensuring robust support for extensive computational tasks.

Model Support Matrix

xFasterTransformer supports a variety of models, primarily through frameworks like PyTorch and C++. It can effectively handle:

ChatGLM series
GLM4
Llama series
Baichuan series
QWen series
SecLLM, Opt, and several other advanced AI models

DataType Support List

The system supports multiple data types, including FP16, BF16, various INT formats, and combinations thereof, catering to diverse processing needs and enhancing the flexibility of model implementation.

Documentation and Resources

An extensive range of resources is available to assist users in deploying xFasterTransformer. This includes detailed API documentation and a comprehensive wiki to provide both high-level and technical insights into its functionalities. Examples are provided to ease the integration process for both C++ and Python users.

Installation Options

From PyPI: Easily installable via pip, making integration straightforward for Python users.
Using Docker: This method involves pulling an xFasterTransformer-specific image, facilitating deployment in isolated environments.
Building from Source: Detailed instructions are provided for those who prefer to compile the tool from source code, with options for manual setup and using CMake or Python setup scripts.

Models Preparation

Converting models from the Huggingface format to one supported by xFasterTransformer is straightforward. A conversion tool built into xFasterTransformer helps accomplish this while ensuring compatibility with various model types.

API Usage

Both Python and C++ APIs mimic the familiarity of the transformers library. The Python API supports features such as tokenization and streaming output, making it highly valuable for developers looking to integrate large language models into their solutions. The C++ API provides similar functionality with the addition of sentence piece tokenization.

Running xFasterTransformer

It supports both single and multi-rank modes, ensuring adaptability whether operating in simpler or more complex multi-node environments. The documentation provides guidelines on setting up running environments to optimize performance.

Web Demo and Serving

For demonstration purposes, there is a web-based interface built using Gradio, enabling easy interaction with models like ChatGLM and Llama2. Additionally, xFasterTransformer's compatibility with standard serving frameworks such as vLLM, FastChat, and MLServer ensures it can be integrated into a wide range of deployment systems.

Benchmark and Support

Benchmark tools are available for users to quickly assess the performance of model inference tasks, while support is provided via email and an online community on WeChat, fostering an environment for collaborative development and troubleshooting.

Accepted Papers and Research

xFasterTransformer has been recognized in various academic circles, contributing knowledge on LLM inference optimizations, particularly on CPU architectures. If the project aids in research endeavors, users are encouraged to cite related academic papers.

Q&A

Addresses common queries, such as hardware compatibility, noting that while xFasterTransformer is optimized for Xeon processors, it does not support Intel Core CPUs due to specific instruction set requirements.