OpenLLM - Seamless API Deployment for Custom and Open-Source Language Models

OpenLLM: Self-Hosting LLMs Made Easy

Overview

OpenLLM is an innovative platform that allows developers to easily run any open-source or custom large language models (LLMs) like Llama 3.2, Qwen2.5, Phi3, and others. It provides the flexibility to operate these models as OpenAI-compatible APIs with just a single command. OpenLLM comes equipped with a built-in chat UI, advanced inference backends, and offers a seamless workflow for deploying enterprise-grade applications in the cloud through Docker, Kubernetes, and BentoCloud.

Getting Started

To begin exploring OpenLLM, you can install it using the following command:

pip install openllm  # or pip3 install openllm
openllm hello

This command line will install OpenLLM and allow you to interact with it for the first time, displaying a warm welcome message.

Supported Models

OpenLLM supports a variety of cutting-edge open-source LLMs. You can also integrate custom models by setting up a model repository. Below is a list of some supported models:

Model	Parameters	Quantization	Required GPU	Start a Server
Llama 3.1	8B	-	24G	`openllm serve llama3.1:8b`
Llama 3.1	8B	AWQ 4bit	12G	`openllm serve llama3.1:8b-4bit`
Llama 3.1	70B	AWQ 4bit	80G	`openllm serve llama3.1:70b-4bit`
...	...	...	...	...

For a comprehensive list of supported models, you can visit the OpenLLM models repository.

Starting an LLM Server

Launching an LLM server locally is straightforward by using the openllm serve command followed by the model version. Note that OpenLLM does not host model weights itself, so a Hugging Face token may be required for some gated models.

openllm serve llama3:8b

This command will start the server, accessible at http://localhost:3000, where it functions as an OpenAI-compatible API. You can interact with the model using different tools and frameworks that support these APIs.

Chat UI

OpenLLM includes a chat interface available at the /chat endpoint of the launched server. This can be accessed by navigating to http://localhost:3000/chat in your browser, providing an interactive way to engage with the models.

Command-Line Interaction

Beyond the chat interface, OpenLLM also allows for model interaction directly through the command line using the following command:

openllm run llama3:8b

This command opens a dialogue with the specified model, making it easy to chat without a graphical interface.

Model Repository

OpenLLM includes a default model repository with the latest open-source LLMs such as Llama 3 and others. You can query the repository to list all available models with:

openllm model list

To keep your local inventory current with the latest updates, use:

openllm repo update

You can view data about a specific model with:

openllm model get llama3:8b

Adding Models

Developers can contribute and add new models to the default model repository for use by others. This involves creating and submitting "Bentos" for LLMs.

Setting Up a Custom Repository

OpenLLM enables the setup of custom model repositories by following a set format, which includes a directory called bentos to store custom models. Developers can utilize BentoML to build and manage these models.

With OpenLLM, the process of self-hosting powerful LLMs becomes simple and efficient, empowering developers to harness advanced language processing capabilities for diverse applications.