Llama-2-Open-Source-LLM-CPU-Inference - Explore Efficient Local CPU Deployment of Quantized LLMs for Document Q&A

Running Llama 2 and Open-Source LLMs on CPU for Document Q&A

In an age where large language models (LLMs) are being widely adopted, there's a growing demand for self-managed or private model deployments. This need arises from concerns such as data privacy and location-specific data handling regulations. The Llama-2-Open-Source-LLM-CPU-Inference project offers a solution by demonstrating how one can run open-source LLMs directly on CPUs for document question-and-answer tasks. This approach is particularly advantageous as it avoids the high costs associated with GPU usage and allows for local deployment.

Project Overview

LLMs like OpenAI's GPT4 have made LLM applications accessible via easy API calls. However, the downsides of such third-party services include cost considerations and various regulations regarding data privacy. By using open-source LLMs, teams can have more control and reduce dependency on external providers. This project provides guidance on running quantized LLM versions, such as Llama 2, on local CPU infrastructure for document-based Q&A tasks, by using tools like GGML and LangChain.

Quickstart Guide

The project offers a simple setup process:

Download Necessary Files: Begin by downloading the GGML binary file of the Llama-2-7B-Chat model from the Hugging Face website and placing it in the models/ folder of the project.
Running the Application: To initiate the application and process user queries, open a terminal in the project directory. Execute the command:
```
poetry run python main.py "<user query>"
```
Replace "<user query>" with your actual question. For instance:
```
poetry run python main.py "What is the minimum guarantee payable by Adidas?"
```
If Poetry is not being used, omit poetry run when entering the command.

Tools and Technologies

The project utilizes various cutting-edge tools to enhance efficiency and performance:

LangChain: A framework designed to help develop applications powered by language models.
C Transformers: Provides bindings for transformer models implemented in C/C++ using GGML.
FAISS: An open-source library used for searching and clustering dense vector spaces efficiently.
Sentence-Transformers (all-MiniLM-L6-v2): A pre-trained transformer model that converts text into dense vectors for semantic search and clustering.
Llama-2-7B-Chat: An open-source variant of the Llama 2 model optimized for chat functionalities, backed by public datasets and over a million human annotations.
Poetry: A tool to manage dependencies and Python packaging seamlessly.

Project Structure

The project's repository contains several crucial components:

/assets: Contains images relevant to the project.
/config: Configuration files needed for the LLM application.
/data: Includes datasets like the Manchester United FC 2022 Annual Report in PDF format.
/models: Stores the binary file of the GGML quantized LLM model.
/src: Houses Python scripts for critical components of the LLM application, including llm.py, utils.py, and prompts.py.
/vectorstore: Directory for the FAISS vector store used for document handling.
db_build.py: A script that processes the dataset and creates the FAISS vector store.
main.py: The script responsible for launching the application and accepting user queries.
pyproject.toml: A TOML file detailing the required dependencies and their versions.
requirements.txt: Lists the Python dependencies and their versions for the project.

Additional Resources

For those interested in further exploring or contributing, there are numerous references provided from the project's resources, including links to GitHub repositories and documentation for LangChain, C Transformers, and GGML. These resources offer deeper insights into the tools and models used in the project, helping users gain a more comprehensive understanding of the project's inner workings.