PPL LLM Serving
Overview
ppl.llm.serving
is a crucial component of the PPL.LLM
system designed to facilitate the serving of various Large Language Models (LLMs). The project is anchored on ppl.nn
, an influential neural network library. The primary aim of ppl.llm.serving
is to create an efficient server environment using gRPC, which provides necessary support for model inference, including popular models like LLaMA.
Prerequisites
To effectively use ppl.llm.serving
, certain tools and platforms are necessary. Users should have access to a Linux operating system running on x86_64 or arm64 CPUs. Essential software includes GCC version 9.4.0 or later, CMake version 3.18 or above, and Git version 2.7.0 or beyond. For those planning to leverage CUDA for acceleration, CUDA Toolkit version 11.4 is the minimum requirement, but version 11.6 is recommended.
Quick Start
Installing Prerequisites
For users on Debian or Ubuntu, getting started involves installing some basic software packages. You can do this with:
apt-get install build-essential cmake git
Cloning Source Code
Once the prerequisites are installed, the next step is to clone the ppl.llm.serving
source code from GitHub:
git clone https://github.com/openppl-public/ppl.llm.serving.git
Building from Source
With the source code at hand, it needs to be built using the following command. This step is critical to prepare the system for execution:
./build.sh -DPPLNN_USE_LLM_CUDA=ON -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'"
If using multiple GPU devices, NCCL is crucial for efficient communication.
Exporting Models
The next step in setting up involves exporting models. Detailed steps for this process can be found on the ppl.pmx
GitHub page.
Running Server
To run the server, execute:
./ppl-build/ppl_llama_server /path/to/server/config.json
Before starting the server, ensure that the configuration file is correctly set up. The configuration should include paths to the exported models (model_dir
), model parameters (model_param_path
), and tokenizer files (tokenizer_path
).
Running Client
After the server is up and running, customers can interact with it using a client application. A simple client request would look like:
./ppl-build/client_sample 127.0.0.1:23333
For more information, refer to tools/client_sample.cc
.
Benchmarking
To analyze the performance, particularly the requests per second, use:
./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf
Detailed insight about this can be found in tools/client_qps_measure.cc
.
Running Inference Offline
For offline inference tasks, use:
./ppl-build/offline_inference /path/to/server/config.json
Guidance can be obtained from tools/offline_inference.cc
.
License
The ppl.llm.serving
project is distributed under the Apache License, Version 2.0, making it accessible for both personal and professional use. For more information about licensing, review the LICENSE documentation.