OpenLLM: Self-Hosting LLMs Made Easy
Overview
OpenLLM is an innovative platform that allows developers to easily run any open-source or custom large language models (LLMs) like Llama 3.2, Qwen2.5, Phi3, and others. It provides the flexibility to operate these models as OpenAI-compatible APIs with just a single command. OpenLLM comes equipped with a built-in chat UI, advanced inference backends, and offers a seamless workflow for deploying enterprise-grade applications in the cloud through Docker, Kubernetes, and BentoCloud.
Getting Started
To begin exploring OpenLLM, you can install it using the following command:
pip install openllm # or pip3 install openllm
openllm hello
This command line will install OpenLLM and allow you to interact with it for the first time, displaying a warm welcome message.
Supported Models
OpenLLM supports a variety of cutting-edge open-source LLMs. You can also integrate custom models by setting up a model repository. Below is a list of some supported models:
Model | Parameters | Quantization | Required GPU | Start a Server |
---|---|---|---|---|
Llama 3.1 | 8B | - | 24G | openllm serve llama3.1:8b |
Llama 3.1 | 8B | AWQ 4bit | 12G | openllm serve llama3.1:8b-4bit |
Llama 3.1 | 70B | AWQ 4bit | 80G | openllm serve llama3.1:70b-4bit |
... | ... | ... | ... | ... |
For a comprehensive list of supported models, you can visit the OpenLLM models repository.
Starting an LLM Server
Launching an LLM server locally is straightforward by using the openllm serve
command followed by the model version. Note that OpenLLM does not host model weights itself, so a Hugging Face token may be required for some gated models.
openllm serve llama3:8b
This command will start the server, accessible at http://localhost:3000, where it functions as an OpenAI-compatible API. You can interact with the model using different tools and frameworks that support these APIs.
Chat UI
OpenLLM includes a chat interface available at the /chat
endpoint of the launched server. This can be accessed by navigating to http://localhost:3000/chat in your browser, providing an interactive way to engage with the models.
Command-Line Interaction
Beyond the chat interface, OpenLLM also allows for model interaction directly through the command line using the following command:
openllm run llama3:8b
This command opens a dialogue with the specified model, making it easy to chat without a graphical interface.
Model Repository
OpenLLM includes a default model repository with the latest open-source LLMs such as Llama 3 and others. You can query the repository to list all available models with:
openllm model list
To keep your local inventory current with the latest updates, use:
openllm repo update
You can view data about a specific model with:
openllm model get llama3:8b
Adding Models
Developers can contribute and add new models to the default model repository for use by others. This involves creating and submitting "Bentos" for LLMs.
Setting Up a Custom Repository
OpenLLM enables the setup of custom model repositories by following a set format, which includes a directory called bentos
to store custom models. Developers can utilize BentoML to build and manage these models.
With OpenLLM, the process of self-hosting powerful LLMs becomes simple and efficient, empowering developers to harness advanced language processing capabilities for diverse applications.