chat.petals.dev - Interactive Platform for Advanced LLM Inference through WebSocket and HTTP APIs

Introduction to Petals Chat

Petals Chat is an interactive chatbot web application that also functions as a gateway for machine learning model inference using the Petals client. This project helps users interact with advanced language models through a web interface or by using APIs to integrate the technology into their applications.

Interactive Chat Features

Petals Chat boasts an interactive feature allowing users to engage with the chatbot directly via a user-friendly web portal found at https://chat.petals.dev. For those interested in hosting their own instance of the chatbot, the project can be cloned from GitHub and run on a personal server, making use of Flask as the development server or Gunicorn for production-level deployment.

Serving Language Models

The project supports various language models, including the renowned Llama 2. To serve these models, users need to acquire access to their weights from Meta AI's website or the Hugging Face Model Hub. If desired, users can configure which models are deployed by modifying config.py.

Backend API Endpoints

Petals Chat provides two major API endpoints to developers for integration:

WebSocket API: This is the preferred method for connecting to the chatbot backend. The WebSocket API (/api/v2/generate) is highlighted for its speed and efficiency in communication, making it highly suitable for applications that require rapid and responsive interaction.
HTTP API: Although it is another option, the HTTP API (/api/v1/...) is recommended less due to possible throughput limitations. For research and development purposes, developers can still utilize the HTTP API endpoint available at https://chat.petals.dev/api/.... However, hosting a dedicated backend for production is advised to ensure consistent performance.

System Requirements for Hosting

For those looking to run the Petals Chat service on their server, attention needs to be given to the system's capabilities:

CPU-only Server: Requires sufficient RAM, particularly if the server does not support AVX512. This affects how embeddings are loaded, impacting memory usage and performance.
GPU Server: Must have enough GPU memory to handle the embeddings for all desired models.

Removing certain models from the config.py file can optimize performance if memory constraints are present.

Advanced Usage with WebSocket API

For developers interested in the WebSocket API details, it allows for establishing a connection to send and receive JSON-encoded requests and responses. An inference session can be initiated, and model outputs can be generated with specific parameters like temperature and top_p. Streaming responses token by token is also supported, facilitating interactive applications like chatbots.

Using the HTTP API

To generate text using the HTTP protocol, users can send POST requests with specific parameters such as the model repository, input text, and maximum length of generated text. Parameters like do_sample, temperature, top_k, and top_p allow fine-tuning of the generation process.

Petals Chat offers a versatile way to experience advanced language model capabilities, whether through direct user interaction or as a backend service integrated into other applications. With options for customization and deployment, Petals Chat is designed to assist developers and users in exploring the power of language models in an efficient and scalable manner.