cog-llama-template - Streamlined Deployment of Open-source LLaMA Models with Cog

Introducing the LLaMA Cog Template 🦙

The LLaMA Cog Template is an innovative project designed to streamline the building of multiple LLaMA models utilizing Cog, a tool for packaging machine learning models into containers. These models, based on LLaMA's architecture, are specifically geared for research purposes, offering robust performance comparable to some of the industry's leading closed-source models.

What Models Are Included?

This monorepo supports the deployment of several configurations of LLaMA and LLaMA2 models, including but not limited to:

llama-2-13b
llama-2-13b-chat
llama-2-13b-transformers
llama-2-70b
llama-2-70b-chat
llama-2-7b
llama-2-7b-chat
llama-2-7b-transformers
llama-2-7b-vllm

These models are available for experimentation on platforms like replicate.com/meta.

Experimental Branch Notice

The current version of the template is experimental, relying on a unique integration with 'exllama'. Users are instructed to clone the 'exllama' repository and switch to a specific branch to ensure compatibility. Efforts are underway to make this integration more seamless.

The Core of the Template

At its heart, the LLaMA Cog Template leverages Meta Research's open-source LLaMA model, which is noted for its capacity to match the effectiveness of proprietary models. This template is designed for cloud deployment via Replicate, employing Cog to create a web interface and API.

Users can operate different sizes of LLaMA and LLaMA2 models ('7B', '13B', and '70B') and integrate fine-tuned models. However, it's crucial to update the system prompts as necessary for accurate model performance.

Note: LLaMA models are for research use only and not intended for commercial operations. Refer to the official Meta Platforms, Inc. website for specific licensing terms.

Prerequisites for Setup

Before getting started, users need certain prerequisites:

LLaMA Weights: Weights are not publicly available and require an application through Meta Research.
GPU Machine: Ideally, a Linux system with an NVIDIA GPU and the NVIDIA Container Toolkit is needed.
Docker: Required for building and deploying models with Cog.

Installation and Configuration Steps

Install Cog: Begin by installing Cog using a simple command-line script.
Setup Weights: Obtain the necessary model weights and organize them in a 'unconverted-weights' directory, then convert them using a provided script.
Tenzorize Weights: This involves further processing the model weights to improve cold-start times.
Run the Model Locally: Users can test the model locally with specific prompts, ensuring they align the question structure to suit LLaMA's capabilities.
Create a Model on Replicate: Navigate to the Replicate site to set up a new model, ensuring it's marked 'private' if desired.
Configure GPU settings: Change the default GPU from T4 to A100 for optimal performance in the settings.
Push Model to Replicate: Use 'cog' to push the model to Replicate, ensuring system permissions are configured correctly.
Deploy and Run the Model: After deployment, the model is accessible through the Replicate web interface or via an API for broader application use.

Community and Contributions

The LLaMA Cog Template was developed by Marco Mascorro (@mascobot) with contributions from the broader Cog and Replicate community. Contributors are always welcome, following the guidelines of the all-contributors specification.

This detailed introduction aims to encapsulate the LLaMA Cog Template's functionality and potential for research applications, offering a comprehensive guide for users seeking to explore the boundaries of language model technologies.