Exploring the llama-api-server Project
Introduction
The llama-api-server offers an efficient way to use Llama as a Service, providing a REST-ful API server that's designed to be compatible with OpenAI's APIs. It leverages open-source backend tools such as llama and llama2, enabling users to apply their pre-trained models seamlessly with various GPT tools and frameworks.
Getting Started
To try out the llama-api-server, users can access an online demo through a collaborative notebook, thanks to the efforts of contributors like anythingbutme. This provides a hands-on experience with the API and its functionalities.
Model Preparation
Before diving into the API's features, users need to prepare their models:
Using llama.cpp
For users without a quantized llama.cpp, the project documentation offers guidance on how to set up the model effectively. The instructions are detailed here.
Using pyllama
Similarly, for those without a quantized version of pyllama, there are clear instructions available here on preparing the model for optimal performance.
Installation
Installing llama-api-server is straightforward. Users can download the package from PyPI using pip
and then generate configuration files for models and security tokens. Here's a quick outline of the installation process:
pip install llama-api-server
# For usage with pyllama
pip install llama-api-server[pyllama]
# Set up configuration
cat > config.yml << EOF
models:
completions:
text-ada-002:
type: llama_cpp
params:
path: /absolute/path/to/your/7B/ggml-model-q4_0.bin
text-davinci-002:
type: pyllama_quant
params:
path: /absolute/path/to/your/pyllama-7B4b.pt
text-davinci-003:
type: pyllama
params:
ckpt_dir: /absolute/path/to/your/7B/
tokenizer_path: /absolute/path/to/your/tokenizer.model
embeddings:
text-embedding-davinci-002:
type: pyllama_quant
params:
path: /absolute/path/to/your/pyllama-7B4b.pt
EOF
# Set the token
echo "SOME_TOKEN" > tokens.txt
# Start the server
python -m llama_api_server
# Or make it accessible across the network
python -m llama_api_server --host=0.0.0.0
Calling the API with openai-python
The llama-api-server can be interfaced using the openai-python package. With appropriate environment variables set, such as OPENAI_API_KEY
and OPENAI_API_BASE
, users can make requests for completions, chat interactions, or embeddings. Here's a sample use case:
export OPENAI_API_KEY=SOME_TOKEN
export OPENAI_API_BASE=http://127.0.0.1:5000/v1
openai api completions.create -e text-ada-002 -p "hello?"
# Or for chat
openai api chat_completions.create -e text-ada-002 -g user "hello?"
# For embedding calls
curl -X POST http://127.0.0.1:5000/v1/embeddings -H 'Content-Type: application/json' -d '{"model":"text-embedding-ada-002", "input":"It is good."}' -H "Authorization: Bearer SOME_TOKEN"
Roadmap and Features
The llama-api-server project is continuously evolving, and its roadmap includes:
- Tested Compatibility: Works smoothly with openai-python and llama-index.
- APIs Supported: Completions, Embeddings, chat functionalities with numerous adjustable parameters.
- Backend Support: Compatible with both llama.cpp and pyllama backends with and without quantization, including support for llama2.
- Additional Features: Incorporating performance tuning parameters, token authentication, and continued documentation development.
This comprehensive setup allows for a robust integration of AI models into various applications, making the llama-api-server a versatile and valuable tool for developers and researchers alike.