paddler - Optimized Slot-Based Load Balancing for Llama.cpp Servers

Introduction to Paddler

Paddler is an innovative and open-source tool, designed specifically to enhance the performance of servers running llama.cpp. It's a stateful load balancer and reverse proxy that is equipped for real-world production use.

Why Paddler?

Paddler addresses the unique challenges posed by traditional load balancing methods, such as round-robin, which are not effective for llama.cpp servers. These servers use continuous batching and slot configurations to handle multiple requests at once, and Paddler is crafted to work in tandem with these features. The slots are essentially designated memory portions on the server that process individual requests, and Paddler keeps track of how many slots are available on each server. This system enables efficient distribution of incoming requests by assigning them to open slots.

Key Features

Slot Monitoring: Paddler employs agents to keep a close watch on the activity and status of each llama.cpp server's slots.
Dynamic Scalability: It supports the addition and removal of server instances, making it compatible with autoscaling setups.
Request Buffering: The tool can handle requests even when no servers are available initially, helping in scaling from zero hosts.
Integrated Dashboard: Paddler provides a built-in dashboard and supports the StatsD protocol for monitoring server status and performance.
AWS Integration: Tailored to work seamlessly within AWS environments.

How Does it Work?

To use Paddler, servers running llama.cpp need monitoring by Paddler agents. These agents communicate with the servers to gather slot status updates and report back to the Paddler load balancer. The architecture involves a continuous cycle where agents verify server status and the load balancer directs requests based on slot availability.

Getting Started

Installation

Paddler can be installed on Linux, Mac, or Windows by downloading the latest version from its releases page. For Linux, renaming the executable as /usr/bin/paddler is recommended for wider accessibility.

Running Components

llama.cpp: Ensure the slots endpoint is active by using the --slots flag when running the server.
Agents: Deploy agents on the same machine as the llama.cpp server to manage and report slot statuses.

Using the agent command, configure network settings and initiate agents:

./paddler agent \
    --external-llamacpp-host 127.0.0.1 \
    --external-llamacpp-port 8088 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8088 \
    --management-host 127.0.0.1 \
    --management-port 8085

Load Balancer: The balancer centralizes data from the agents and functions as a reverse proxy to the outer network.

./paddler balancer \
    --management-host 127.0.0.1 \
    --management-port 8085 \
    --reverseproxy-host 196.168.2.10 \
    --reverseproxy-port 8080

Advanced Features

Dashboard and Metrics: Enable enhanced monitoring capabilities and integrate AWS elements for IP management.
Host Header Rewriting and API Keys: Adapt request handling and add security layers as needed.

Extra Functionalities

Aggregated Health Status: This feature collates health data from llama.cpp instances to give a fuller picture of server readiness.
Buffered Requests: Allows for temporary holding of requests, granting infrastructure time to adapt to changes in server availability.
State Dashboard: Offers real-time insights into the operational status of your cluster.

Why Paddler?

The name originates from an initial plan to implement a Raft consensus algorithm, embodying the idea of "paddling" on a raft. Although the algorithm was never integrated, the name stuck, becoming part of the project's identity.

Community and Support

Paddler encourages users to join discussions and stay updated through its Discord channel, fostering a collaborative environment around this robust project.