chatllm.cpp - Real-time Interaction Using Diverse Models for Efficient CPU Chatting

ChatLLM.cpp: A Comprehensive Overview

Introduction

ChatLLM.cpp is a captivating project designed to facilitate real-time chatting by utilizing a variety of models ranging from less than 1 billion to over 300 billion parameters. This project operates efficiently on personal computers using CPU and is implemented entirely in C++ leveraging the ggml library by GitHub user @ggerganov. Through its innovative design, ChatLLM.cpp provides seamless and efficient live chat capabilities.

Noteworthy Updates

The project is continually evolving, with recent enhancements including the release of LlaMA 3.2 in September 2024, support for Qwen 2.5, and the introduction of several other models such as OLMoE and MiniCPM3. In addition, tool calling features and integration with the OpenAI API have been added, expanding the project's functionality and compatibility.

Key Features

Efficient CPU Inference: ChatLLM.cpp employs int4/int8 quantization and optimized KV caches, enhancing its computational efficiency.
OOP Design: Object-oriented programming is utilized to manage similarities across different Transformer-based models.
Streaming Generation: The typewriter effect enhances the chatting experience.
Continuous Chatting: Chat conversations can extend indefinitely through "Restart" and "Shift" methods.
Retrieval Augmented Generation (RAG): This feature enhances real-time chat capabilities.
LoRA Models: The project supports numerous LoRA models alongside its standard offerings.
Versatile Bindings: ChatLLM.cpp provides Python, JavaScript, and C bindings, allowing for a variety of applications and web demos.

Getting Started

Starting with ChatLLM.cpp is straightforward with simple commands. Users can begin with python chatllm.py -i -m :model_id. Comprehensive instructions are available in the project's quick start guide.

Usage Instructions

Preparation

Begin by cloning the repository using:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

In case the --recursive option was omitted, run the following to initialize submodules:

git submodule update --init --recursive

Model Quantization

Supporting various models, ChatLLM.cpp allows users to transform models into a quantized format using convert.py. Installation requirements for this tool can be fulfilled with:

pip install -r requirements.txt

Then, models can be converted as follows:

python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

If merging a LoRA model, use:

python3 convert.py -i path/to/model -l path/to/lora/model -o quantized.bin

Building the Project

ChatLLM.cpp can be built through several methods:

Using make: Set up the Windows environment using w64devkit, then execute:
```
make
```
Using CMake: On Linux or Windows, execute:
```
cmake -B build
cmake --build build -j
```

Execution

To chat with a quantized model, run commands such as:

./build/bin/main -m chatglm-ggml.bin

For interactive mode, include the -i flag:

rlwrap ./build/bin/main -m model.bin -i

You can explore additional options using ./build/bin/main -h.

Acknowledgments and Note

The project is an evolution of ChatGLM.cpp, and it would not have been possible without contributions from various sources. It remains a hobby project under active development. Feature contributions are not currently accepted, but bug fixes are welcomed with open arms.