ppl.llm.serving - Optimize LLM Deployment with gRPC on PPL.NN Platform

PPL LLM Serving

Overview

ppl.llm.serving is a crucial component of the PPL.LLM system designed to facilitate the serving of various Large Language Models (LLMs). The project is anchored on ppl.nn, an influential neural network library. The primary aim of ppl.llm.serving is to create an efficient server environment using gRPC, which provides necessary support for model inference, including popular models like LLaMA.

Prerequisites

To effectively use ppl.llm.serving, certain tools and platforms are necessary. Users should have access to a Linux operating system running on x86_64 or arm64 CPUs. Essential software includes GCC version 9.4.0 or later, CMake version 3.18 or above, and Git version 2.7.0 or beyond. For those planning to leverage CUDA for acceleration, CUDA Toolkit version 11.4 is the minimum requirement, but version 11.6 is recommended.

Quick Start

Installing Prerequisites

For users on Debian or Ubuntu, getting started involves installing some basic software packages. You can do this with:

apt-get install build-essential cmake git

Cloning Source Code

Once the prerequisites are installed, the next step is to clone the ppl.llm.serving source code from GitHub:

git clone https://github.com/openppl-public/ppl.llm.serving.git

Building from Source

With the source code at hand, it needs to be built using the following command. This step is critical to prepare the system for execution:

./build.sh -DPPLNN_USE_LLM_CUDA=ON -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'"

If using multiple GPU devices, NCCL is crucial for efficient communication.

Exporting Models

The next step in setting up involves exporting models. Detailed steps for this process can be found on the ppl.pmx GitHub page.

Running Server

To run the server, execute:

./ppl-build/ppl_llama_server /path/to/server/config.json

Before starting the server, ensure that the configuration file is correctly set up. The configuration should include paths to the exported models (model_dir), model parameters (model_param_path), and tokenizer files (tokenizer_path).

Running Client

After the server is up and running, customers can interact with it using a client application. A simple client request would look like:

./ppl-build/client_sample 127.0.0.1:23333

For more information, refer to tools/client_sample.cc.

Benchmarking

To analyze the performance, particularly the requests per second, use:

./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf

Detailed insight about this can be found in tools/client_qps_measure.cc.

Running Inference Offline

For offline inference tasks, use:

./ppl-build/offline_inference /path/to/server/config.json

Guidance can be obtained from tools/offline_inference.cc.

License

The ppl.llm.serving project is distributed under the Apache License, Version 2.0, making it accessible for both personal and professional use. For more information about licensing, review the LICENSE documentation.