RayLLM: Simplifying LLM Deployment with Ray Serve
Introduction
RayLLM is an innovative project aimed at streamlining the setup and deployment of large language models (LLMs) using Ray Serve. Although initially focused on creating a separate library, the project now integrates seamless deployment examples into Ray Serve's documentation, reducing complexity for users and eliminating the need to learn an additional library.
What is RayLLM?
RayLLM, originally known as Aviary, is a solution for serving and managing various open-source LLMs efficiently. It's designed on top of Ray Serve, offering users a powerful and streamlined experience. Key features of RayLLM include:
- A wide array of pre-configured open-source LLMs.
- Compatibility with Transformer models from Hugging Face Hub or local storage.
- Simplified deployment and integration processes for multiple and new LLMs.
- Unique autoscaling capabilities, including scaling down to zero when no demand is present.
- Full support for multi-GPU and multi-node deployments.
- Advanced performance features like continuous batching, model quantization, and streaming.
- A REST API reminiscent of OpenAI's API, facilitating easy migration and cross-testing.
- Support for multiple LLM backends such as vLLM and TensorRT-LLM.
Deployment and Use
Local Deployment
For local deployment, it is advised to use the official Docker image anyscale/ray-llm
. This simplifies the process as manual installation is not supported due to specific dependencies. A basic deployment can be initiated using:
docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash
# Inside docker container
serve run ~/serve_configs/amazon--LightGPT.yaml
Deployment on a Ray Cluster
RayLLM also supports deployment on Ray Clusters, particularly on AWS, provided that AWS credentials are exported and configured correctly.
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
Once set up, users can deploy models by connecting to the Ray Cluster and executing the necessary commands.
Kubernetes Deployment
RayLLM supports deployments on Kubernetes via KubeRay, with guidance available in the project's documentation.
Querying Models
Once models are deployed, clients outside the Docker container can query the backend. This can be done using:
- Curl for command line interaction.
- Python Scripts using the
requests
library for integrating deployments into applications. - OpenAI SDK by utilizing an OpenAI-compatible API.
Installation and CLI Usage
RayLLM can be installed from its GitHub repository, including optional frontend dependencies for an enhanced interface. The CLI provides functionalities to manage model deployments, check status, and shutdown applications.
Adding and Managing Models
Users can add new models by creating configuration files, or deploy multiple models at once by compiling Serve configs into a unified configuration.
Debugging and Support
Potential deployment issues may arise from incorrect model specifications or inadequate resources. Tools like the Ray Dashboard provide critical insights for troubleshooting. Moreover, there's ample support available via community forums, Slack, and GitHub for addressing bugs or contributing improvements.
Conclusion
RayLLM represents a significant step forward in LLM deployment, providing a robust platform that balances functionality with simplicity. By building on Ray Serve’s capabilities, it offers a streamlined approach to deploying and managing multiple language models, whether locally, on cloud clusters, or Kubernetes, with ease and efficiency.