llama-api-server - Compatible API Server Utilizing Open Source Technologies for OpenAI Integrations

Exploring the llama-api-server Project

Introduction

The llama-api-server offers an efficient way to use Llama as a Service, providing a REST-ful API server that's designed to be compatible with OpenAI's APIs. It leverages open-source backend tools such as llama and llama2, enabling users to apply their pre-trained models seamlessly with various GPT tools and frameworks.

Getting Started

To try out the llama-api-server, users can access an online demo through a collaborative notebook, thanks to the efforts of contributors like anythingbutme. This provides a hands-on experience with the API and its functionalities.

Model Preparation

Before diving into the API's features, users need to prepare their models:

Using llama.cpp

For users without a quantized llama.cpp, the project documentation offers guidance on how to set up the model effectively. The instructions are detailed here.

Using pyllama

Similarly, for those without a quantized version of pyllama, there are clear instructions available here on preparing the model for optimal performance.

Installation

Installing llama-api-server is straightforward. Users can download the package from PyPI using pip and then generate configuration files for models and security tokens. Here's a quick outline of the installation process:

pip install llama-api-server

# For usage with pyllama
pip install llama-api-server[pyllama]

# Set up configuration
cat > config.yml << EOF
models:
  completions:
    text-ada-002:
      type: llama_cpp
      params:
        path: /absolute/path/to/your/7B/ggml-model-q4_0.bin
    text-davinci-002:
      type: pyllama_quant
      params:
        path: /absolute/path/to/your/pyllama-7B4b.pt
    text-davinci-003:
      type: pyllama
      params:
        ckpt_dir: /absolute/path/to/your/7B/
        tokenizer_path: /absolute/path/to/your/tokenizer.model
  embeddings:
    text-embedding-davinci-002:
      type: pyllama_quant
      params:
        path: /absolute/path/to/your/pyllama-7B4b.pt
EOF

# Set the token
echo "SOME_TOKEN" > tokens.txt

# Start the server
python -m llama_api_server
# Or make it accessible across the network
python -m llama_api_server --host=0.0.0.0

Calling the API with openai-python

The llama-api-server can be interfaced using the openai-python package. With appropriate environment variables set, such as OPENAI_API_KEY and OPENAI_API_BASE, users can make requests for completions, chat interactions, or embeddings. Here's a sample use case:

export OPENAI_API_KEY=SOME_TOKEN
export OPENAI_API_BASE=http://127.0.0.1:5000/v1

openai api completions.create -e text-ada-002 -p "hello?"
# Or for chat
openai api chat_completions.create -e text-ada-002 -g user "hello?"
# For embedding calls
curl -X POST http://127.0.0.1:5000/v1/embeddings -H 'Content-Type: application/json' -d '{"model":"text-embedding-ada-002", "input":"It is good."}'  -H "Authorization: Bearer SOME_TOKEN"

Roadmap and Features

The llama-api-server project is continuously evolving, and its roadmap includes:

Tested Compatibility: Works smoothly with openai-python and llama-index.
APIs Supported: Completions, Embeddings, chat functionalities with numerous adjustable parameters.
Backend Support: Compatible with both llama.cpp and pyllama backends with and without quantization, including support for llama2.
Additional Features: Incorporating performance tuning parameters, token authentication, and continued documentation development.

This comprehensive setup allows for a robust integration of AI models into various applications, making the llama-api-server a versatile and valuable tool for developers and researchers alike.