chatglm.cpp - Cross-Platform Real-Time Chatting with Efficient Quantized Models

Introduction to ChatGLM.cpp

ChatGLM.cpp is a robust C++ implementation designed for seamless, real-time chat experiences on various platforms, including MacBooks. It supports several models generated from the ChatGLM series, such as ChatGLM-6B, ChatGLM2-6B, ChatGLM3, and GLM-4.

Key Features

Core Attributes:

Pure C++ Implementation: The project follows the structure of ggml, similar to the well-known llama.cpp, ensuring lightweight and efficient processing.
Optimized Performance: Includes accelerated memory-efficient CPU inference, utilizing int4/int8 quantization, and optimized key-value cache and parallel computing for fast processing.
Model Adaptability: Supports adjustments with P-Tuning v2 and LoRA finetuned models, allowing users to align the model's output more closely with specific needs.
Dynamic Interaction: Features streaming generation that comes with a typewriter effect, adding a touch of realism to interactions.
Cross-platform Interfaces: Provides Python binding, web demos, API servers, and much more, expanding its usability across various platforms and languages.

Support Matrix:

Hardware: Operates smoothly on x86/arm CPUs, NVIDIA GPUs, and Apple Silicon GPUs.
Platforms: Compatible with Linux, MacOS, and Windows.
Models: Compatible with multiple models like ChatGLM-6B, ChatGLM2-6B, ChatGLM3, GLM-4, and CodeGeeX2.

Getting Started

To start using ChatGLM.cpp, follow the steps below.

Preparation:

Clone the Repository:

Use git commands to clone and navigate to the repository.

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

Quantize Model:
- Install necessary packages.
```
python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece
```
- Convert models into a quantized format using convert.py for efficient usage.

Build & Run:

Compile the project with CMake.

cmake -B build
cmake --build build -j --config Release

Run the quantized model for chat experiences.

./build/bin/main -m models/chatglm-ggml.bin -p 你好

Advanced Uses

ChatGLM.cpp extends its capabilities into various advanced functionalities, from interactive chat modes to supporting complex functions like code interpretation or even programming tasks with CodeGeeX2.

Performance Enhancement with BLAS and More

ChatGLM.cpp supports faster computation using BLAS libraries for matrix multiplications and CUDA for NVIDIA GPUs, enhancing the performance for users with compatible hardware.

Python Binding

The project includes Python bindings that allow high-level functions akin to those provided by Hugging Face models. Users can either install directly from PyPI or build from source for additional configuration and acceleration capabilities.

Conclusion

ChatGLM.cpp is an exemplary project for anyone interested in deploying advanced chat models in real-time using C++. With its support for multiple hardware, platforms, and models, it positions itself as a versatile tool for developers. Whether you are a hobbyist exploring AI models or a professional deploying enterprise applications, ChatGLM.cpp offers the flexibility and performance you need.