TinyLLM - Facilitate LLM Deployment with Consumer-Level Hardware

TinyLLM: A Powerful LLM on a Tiny System

Introduction

TinyLLM may sound like a paradoxical name—putting a large language model (LLM) on a small system. Yet, it aptly describes its mission: to run robust LLMs on minimal hardware while maintaining satisfactory performance. This project enables anyone to build a locally hosted LLM with an interface akin to ChatGPT, using consumer-grade hardware.

Key Features

Multi-Model Support: TinyLLM allows integration with various LLMs.
Local API Web Service: It can host a local OpenAI API service through tools like Ollama, llama.cpp, or vLLM.
Interactive Web Interface: The chatbot interface permits customizable prompts and pulls data from different sources, including websites, databases, and live data like news, stocks, and weather.

Hardware and Software Requirements

CPU: Compatible with Intel, AMD, or Apple Silicon.
Memory: Minimum of 8GB DDR4 recommended.
Storage: At least 128GB SSD.
GPU: Should be NVIDIA (e.g., GTX 1060 6GB, RTX 3090 24GB) or Apple M1/M2.
Operating System: Ubuntu Linux or macOS.
Software: Python 3 and CUDA version 12.2 are required.

Getting Started

Although there is a placeholder for a quick start setup script, users can manually get started by cloning the GitHub repository of TinyLLM.

git clone https://github.com/jasonacox/TinyLLM.git
cd TinyLLM

Running a Local LLM

To operate a local LLM, the project supports the following inference servers: vLLM, llama-cpp-python, and Ollama. These solutions provide OpenAI-compatible APIs, facilitating easy integration with other tools.

Ollama Server: Great for diverse systems like MacOS, Linux, and Windows. It offers an OpenAI-compatible API using a llama.cpp engine. Note that it handles one session at a time.
vLLM Server: Offers robust services, allowing multiple threads simultaneously. It supports non-quantized models and requires GPUs with greater VRAM, although AWQ models provide more efficiency.
llama-cpp-python Server: Optimized for consumer-grade GPUs and supports simplified GGUF models. Similar to Ollama, it handles one session per time.

Chatbot Application

The TinyLLM Chatbot provides a web interface via Python FastAPI, allowing users to chat with an LLM using the OpenAI API. It supports multiple sessions and retains conversational history. It also supports RAG (Retrieval Augmented Generation) with features like:

Summarization of web content and PDFs.
Fetching and summarizing news headlines.
Providing stock prices and weather updates.
Integration with vector databases for additional queries.

Example Session

Users can start a session by visiting http://localhost:5000, interact with the chatbot, and give commands like /news for current headlines or /weather <location> for weather updates.

Model Recommendations

TinyLLM supports numerous models from HuggingFace that perform effectively depending on the server being used.

llama-cpp-python Server: Supports GGUF models, such as Mistral v0.1 (7B) and Llama-2 (7B).
vLLM Server: Compatible with broader and potentially larger models like Mistral v0.1 (7B AWQ) and Meta Llama (8B).

Additional Tools

The project also highlights utility tools for interacting with LLMs, such as the llm CLI utility, providing means to configure and test local LLM APIs effectively.

References

TinyLLM builds on significant projects like LLaMa.cpp, llama-cpp-python, and vLLM, which serve as its technical backbone.

By bringing powerful language processing capabilities to more accessible hardware, TinyLLM helps democratize AI technology for developers and enthusiasts interested in LLMs.