fastllm - Multi-Platform C++ Inference Library for Accelerated Large Model Processing

Introducing fastllm

fastllm is a high-performance, large-model inference library designed specifically for deployment across multiple platforms. It is implemented entirely in C++, making it both efficient and flexible, with no dependencies on third-party libraries. This ensures it can be easily ported to different platforms, including Android.

Key Features

Pure C++ Implementation: fastllm is written in C++, allowing for seamless cross-platform integration and straightforward compilation for Android devices.
High Speed: It offers fast performance on various platforms, whether it's ARM, X86, or NVIDIA.
Model Support: The library can read original Hugging Face models and perform direct quantization.
OpenAI API Server Deployment: fastllm supports deployment as an OpenAI API server.
Multi-GPU Deployment: The library supports multi-card deployment, enabling CPU and GPU hybrid deployment for better performance.
Dynamic Batch and Streaming Output: It can handle dynamic batch sizes and provides streaming output capabilities.
Modular Architecture: fastllm has a front-end and back-end separated design, which facilitates support for new computational devices.
Model Compatibility: It currently supports various models such as ChatGLM, Qwen, LLAMA (including ALPACA, VICUNA), BAICHUAN, MOSS, Minicpm, and more.
Customizable: Users can define custom model structures using Python.

Getting Started

Compilation

fastllm is easy to compile using CMake, but it requires pre-installation of gcc, g++, make, and cmake. For GPU support, CUDA must be pre-installed.

Use the following command to compile:

bash install.sh -DUSE_CUDA=ON # Compiles with GPU support
# bash install.sh -DUSE_CUDA=ON -DCUDA_ARCH=89 # Specify CUDA architecture, e.g., for 4090 use architecture 89
# bash install.sh # Compiles only the CPU version

Running Demo Programs in Python

Assume your model is located in the directory "~/Qwen2-7B-Instruct/". After compilation, several demos can be executed:

# OpenAI API server setup
# Requires: pip install -r requirements-server.txt
# Starts a server named "qwen" on port 8080
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen

# Chat with float16 precision
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/

# Online quantized INT8 model chat
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8

# Simple web UI
# Requires: pip install streamlit-chat
python3 -m ftllm.webui -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080

For more details on command options, use the --help parameter or refer to the parameter documentation.

Running in C++

After navigating to the fastllm build directory:

# Command-line chat program with typewriter effect
./main -p ~/Qwen2-7B-Instruct/

# Simple web UI with streaming output and dynamic batch support, allowing for multi-threaded access
./webui -p ~/Qwen2-7B-Instruct/ --port 1234

Compiling under Windows is recommended using CMake GUI and Visual Studio. If issues occur, especially on Windows, refer to the FAQ documentation.

Python API

# Model creation
from ftllm import llm
model = llm.model("~/Qwen2-7B-Instruct/")

# Generate a response
print(model.response("Hello"))

# Stream response generation
for response in model.stream_response("Hello"):
    print(response, flush=True, end="")

The package also allows settings like CPU thread adjustment. For detailed API information, visit ftllm API documentation.

Multi-Device Deployment

fastllm supports deployment across multiple devices, including setting ratios for specific device allocation.

In Python Command-Line

# Set device usage with --device parameter
# --device cuda:1 # Single device setting
# --device "['cuda:0', 'cuda:1']" # Distribute model equally across devices
# --device "{'cuda:0': 10, 'cuda:1': 5, 'cpu': 1}" # Set proportional deployment across devices

Within ftllm

from ftllm import llm
# Configure before model creation
llm.set_device_map("cuda:0") # Single device deployment
llm.set_device_map(["cuda:0", "cuda:1"]) # Equal distribution across devices
llm.set_device_map({"cuda:0": 10, "cuda:1": 5, "cpu": 1}) # Proportional distribution

C++ Multi-Device Usage

// Configure before model creation
fastllm::SetDeviceMap({{"cuda:0", 10}, {"cuda:1", 5}, {"cpu", 1}}); // Proportional distribution across devices

Docker Compilation and Execution

For Docker execution, NVIDIA Runtime must be installed locally, and the default runtime should be set to NVIDIA.

Install nvidia-container-runtime:

sudo apt-get install nvidia-container-runtime

Set Docker default runtime to NVIDIA by editing /etc/docker/daemon.json.
Download Converted Models to the models directory.

Compile and Launch web UI:

DOCKER_BUILDKIT=0 docker compose up -d --build

Running on Android

Compilation

To compile for Android, download the NDK tool. Alternatively, compilation can be done on the device itself using Termux with cmake and gcc.

mkdir build-android
cd build-android
export NDK=<your_ndk_directory>
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_CXX_FLAGS=-march=armv8.2a+dotprod ..
make -j

Execution

Install Termux on the Android device.
Use termux-setup-storage to enable file access.
Transfer the compiled main file and model files to the device.
Grant execution permission using chmod 777 main.
Execute the main file.

This introduction captures the essence and usability of fastllm, a robust tool for machine learning model deployment across varied computing platforms.