ChatLLM.cpp: A Comprehensive Overview
Introduction
ChatLLM.cpp is a captivating project designed to facilitate real-time chatting by utilizing a variety of models ranging from less than 1 billion to over 300 billion parameters. This project operates efficiently on personal computers using CPU and is implemented entirely in C++ leveraging the ggml library by GitHub user @ggerganov. Through its innovative design, ChatLLM.cpp provides seamless and efficient live chat capabilities.
Noteworthy Updates
The project is continually evolving, with recent enhancements including the release of LlaMA 3.2 in September 2024, support for Qwen 2.5, and the introduction of several other models such as OLMoE and MiniCPM3. In addition, tool calling features and integration with the OpenAI API have been added, expanding the project's functionality and compatibility.
Key Features
- Efficient CPU Inference: ChatLLM.cpp employs int4/int8 quantization and optimized KV caches, enhancing its computational efficiency.
- OOP Design: Object-oriented programming is utilized to manage similarities across different Transformer-based models.
- Streaming Generation: The typewriter effect enhances the chatting experience.
- Continuous Chatting: Chat conversations can extend indefinitely through "Restart" and "Shift" methods.
- Retrieval Augmented Generation (RAG): This feature enhances real-time chat capabilities.
- LoRA Models: The project supports numerous LoRA models alongside its standard offerings.
- Versatile Bindings: ChatLLM.cpp provides Python, JavaScript, and C bindings, allowing for a variety of applications and web demos.
Getting Started
Starting with ChatLLM.cpp is straightforward with simple commands. Users can begin with python chatllm.py -i -m :model_id
. Comprehensive instructions are available in the project's quick start guide.
Usage Instructions
Preparation
Begin by cloning the repository using:
git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
In case the --recursive
option was omitted, run the following to initialize submodules:
git submodule update --init --recursive
Model Quantization
Supporting various models, ChatLLM.cpp allows users to transform models into a quantized format using convert.py
. Installation requirements for this tool can be fulfilled with:
pip install -r requirements.txt
Then, models can be converted as follows:
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin
If merging a LoRA model, use:
python3 convert.py -i path/to/model -l path/to/lora/model -o quantized.bin
Building the Project
ChatLLM.cpp can be built through several methods:
-
Using
make
: Set up the Windows environment using w64devkit, then execute:make
-
Using
CMake
: On Linux or Windows, execute:cmake -B build cmake --build build -j
Execution
To chat with a quantized model, run commands such as:
./build/bin/main -m chatglm-ggml.bin
For interactive mode, include the -i
flag:
rlwrap ./build/bin/main -m model.bin -i
You can explore additional options using ./build/bin/main -h
.
Acknowledgments and Note
The project is an evolution of ChatGLM.cpp, and it would not have been possible without contributions from various sources. It remains a hobby project under active development. Feature contributions are not currently accepted, but bug fixes are welcomed with open arms.