aphrodite-engine - Optimize AI Model Inference with Aphrodite Engine's Rapid Serving Features

Introduction to Aphrodite-Engine

Overview

Aphrodite is the official backend engine for PygmalionAI, designed to bring language models to life with speed and efficiency. It acts as the primary inference endpoint for the PygmalionAI website, allowing users to experience Hugging Face-compatible models quickly and seamlessly. The backbone of Aphrodite's speed is vLLM's innovative Paged Attention, propelling it to outstanding performance.

Latest News & Updates

In September 2024, Aphrodite released version 0.6.1, introducing support for loading FP16 models in the FP2 to FP7 quant formats. This advancement enhances both throughput and memory efficiency. The earlier version 0.6.0 brought remarkable improvements in speed and introduced several new quant formats such as FP8 and llm-compressor, along with features like asymmetric tensor parallel and pipeline parallel.

Key Features

Continuous Batching: Ensures seamless processing of requests, enhancing the user experience.
Efficient K/V Management: Utilizes PagedAttention from vLLM for optimal performance.
Optimized CUDA Kernels: Improves inference speeds significantly.
Quantization Support: Includes diverse quantization plugins like AQLM, AWQ, and more, allowing for a broad range of operations.
Distributed Inference: Facilitates scalable usage and deployment.
8-bit KV Cache: Enhances context length support and throughput, with formats like FP8 E5M3 and E4M3.

Quickstart Guide

To get started with Aphrodite, installing the engine is straightforward:

pip install -U aphrodite-engine

Launching a model is just as easy:

aphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct

This command sets up an OpenAI-compatible API server. The API can be integrated with various UIs supporting OpenAI, such as SillyTavern.

More detailed usage instructions and options are available in the official documentation.

Docker Deployment

Aphrodite can also be deployed using Docker, allowing for easier setup:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 2242:2242 \
    --ipc=host \
    alpindale/aphrodite-openai:latest \
    --model NousResearch/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 8 \
    --api-keys "sk-empty"

This command downloads the latest Aphrodite Engine image and launches it with a specified model.

Requirements & Compatibility

Aphrodite requires a Linux operating system or Windows with WSL, supporting Python versions 3.8 to 3.12. For maximum performance, CUDA version 11 or later is required. The engine supports a broad range of hardware, including NVIDIA GPUs, AMD GPUs, Intel CPUs and GPUs, Google TPU, and AWS Inferentia.

Notes for Users

By default, Aphrodite uses up to 90% of a GPU's VRAM but can be adjusted with the --gpu-memory-utilization option if less memory usage is needed.
Users can explore a comprehensive list of commands by entering aphrodite run --help.

Acknowledgements

Aphrodite Engine is made possible by leveraging numerous open-source projects, including vLLM, TensorRT-LLM, Flash Attention, llama.cpp, and many others that have contributed exceptional technology and developments.

How to Contribute

The Aphrodite project is open to contributions. Enthusiasts and developers can support the project by submitting Pull Requests for new features, bug fixes, or improvements to user experience.

This comprehensive introduction to the Aphrodite-engine aims to provide clear insights into its functionality, setup, and potential, emphasizing its role in enhancing language model accessibility and performance.