Introducing fastllm
fastllm is a high-performance, large-model inference library designed specifically for deployment across multiple platforms. It is implemented entirely in C++, making it both efficient and flexible, with no dependencies on third-party libraries. This ensures it can be easily ported to different platforms, including Android.
Key Features
- Pure C++ Implementation: fastllm is written in C++, allowing for seamless cross-platform integration and straightforward compilation for Android devices.
- High Speed: It offers fast performance on various platforms, whether it's ARM, X86, or NVIDIA.
- Model Support: The library can read original Hugging Face models and perform direct quantization.
- OpenAI API Server Deployment: fastllm supports deployment as an OpenAI API server.
- Multi-GPU Deployment: The library supports multi-card deployment, enabling CPU and GPU hybrid deployment for better performance.
- Dynamic Batch and Streaming Output: It can handle dynamic batch sizes and provides streaming output capabilities.
- Modular Architecture: fastllm has a front-end and back-end separated design, which facilitates support for new computational devices.
- Model Compatibility: It currently supports various models such as ChatGLM, Qwen, LLAMA (including ALPACA, VICUNA), BAICHUAN, MOSS, Minicpm, and more.
- Customizable: Users can define custom model structures using Python.
Getting Started
Compilation
fastllm is easy to compile using CMake, but it requires pre-installation of gcc, g++, make, and cmake. For GPU support, CUDA must be pre-installed.
Use the following command to compile:
bash install.sh -DUSE_CUDA=ON # Compiles with GPU support
# bash install.sh -DUSE_CUDA=ON -DCUDA_ARCH=89 # Specify CUDA architecture, e.g., for 4090 use architecture 89
# bash install.sh # Compiles only the CPU version
Running Demo Programs in Python
Assume your model is located in the directory "~/Qwen2-7B-Instruct/". After compilation, several demos can be executed:
# OpenAI API server setup
# Requires: pip install -r requirements-server.txt
# Starts a server named "qwen" on port 8080
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen
# Chat with float16 precision
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/
# Online quantized INT8 model chat
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8
# Simple web UI
# Requires: pip install streamlit-chat
python3 -m ftllm.webui -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080
For more details on command options, use the --help
parameter or refer to the parameter documentation.
Running in C++
After navigating to the fastllm build directory:
# Command-line chat program with typewriter effect
./main -p ~/Qwen2-7B-Instruct/
# Simple web UI with streaming output and dynamic batch support, allowing for multi-threaded access
./webui -p ~/Qwen2-7B-Instruct/ --port 1234
Compiling under Windows is recommended using CMake GUI and Visual Studio. If issues occur, especially on Windows, refer to the FAQ documentation.
Python API
# Model creation
from ftllm import llm
model = llm.model("~/Qwen2-7B-Instruct/")
# Generate a response
print(model.response("Hello"))
# Stream response generation
for response in model.stream_response("Hello"):
print(response, flush=True, end="")
The package also allows settings like CPU thread adjustment. For detailed API information, visit ftllm API documentation.
Multi-Device Deployment
fastllm supports deployment across multiple devices, including setting ratios for specific device allocation.
In Python Command-Line
# Set device usage with --device parameter
# --device cuda:1 # Single device setting
# --device "['cuda:0', 'cuda:1']" # Distribute model equally across devices
# --device "{'cuda:0': 10, 'cuda:1': 5, 'cpu': 1}" # Set proportional deployment across devices
Within ftllm
from ftllm import llm
# Configure before model creation
llm.set_device_map("cuda:0") # Single device deployment
llm.set_device_map(["cuda:0", "cuda:1"]) # Equal distribution across devices
llm.set_device_map({"cuda:0": 10, "cuda:1": 5, "cpu": 1}) # Proportional distribution
C++ Multi-Device Usage
// Configure before model creation
fastllm::SetDeviceMap({{"cuda:0", 10}, {"cuda:1", 5}, {"cpu", 1}}); // Proportional distribution across devices
Docker Compilation and Execution
For Docker execution, NVIDIA Runtime must be installed locally, and the default runtime should be set to NVIDIA.
-
Install nvidia-container-runtime:
sudo apt-get install nvidia-container-runtime
-
Set Docker default runtime to NVIDIA by editing
/etc/docker/daemon.json
. -
Download Converted Models to the models directory.
-
Compile and Launch web UI:
DOCKER_BUILDKIT=0 docker compose up -d --build
Running on Android
Compilation
To compile for Android, download the NDK tool. Alternatively, compilation can be done on the device itself using Termux with cmake and gcc.
mkdir build-android
cd build-android
export NDK=<your_ndk_directory>
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_CXX_FLAGS=-march=armv8.2a+dotprod ..
make -j
Execution
- Install Termux on the Android device.
- Use
termux-setup-storage
to enable file access. - Transfer the compiled
main
file and model files to the device. - Grant execution permission using
chmod 777 main
. - Execute the main file.
This introduction captures the essence and usability of fastllm, a robust tool for machine learning model deployment across varied computing platforms.