InferLLM - Streamlined and Efficient Inference Framework for LLM Models

Introduction to InferLLM

InferLLM is a lightweight framework designed to simplify the process of using large language models (LLMs) for inference. Inspired by the llama.cpp project, InferLLM addresses some limitations and introduces several advantageous features to enhance usability and performance.

Key Features

Simplified Structure: InferLLM is built with a straightforward approach, making it easy for developers to get started and learn the framework. It achieves this by decoupling the framework from its kernel components.
Efficient Performance: The framework is optimized to be highly efficient by incorporating most of the kernel functions from llama.cpp.
Optimized KVstorage: A specialized key-value storage type is created to efficiently handle caching and management tasks.
Multi-Model Compatibility: InferLLM supports various model formats. Currently, it is compatible with int4 models in both Chinese and English languages, specifically for the Alpaca model.
CPU and GPU Support: This framework is versatile, supporting both CPUs and GPUs. It has been specially optimized for different processor architectures like Arm, x86, CUDA, and riscv-vector. Moreover, it runs efficiently on mobile devices.

Recent Updates

2023.08.16: Introduced support for the LLama-2-7B model.
2023.08.08: Enhanced performance on Arm architecture, specifically optimizing the int4 matmul kernel using ARM assembly and kernel packing.
Previous: Added support for models like chatglm/chatglm2, baichuan, alpaca, and ggml-llama.

How to Use InferLLM

Model Downloading

InferLLM supports models from the llama.cpp project and also allows downloading from Hugging Face under kewin4933/InferLLM-Model. This includes alpaca, llama2, chatglm/chatglm2, and baichuan models available in both Chinese and English int4 formats.

Compiling InferLLM

Local Compilation:

For compiling locally, execute the following commands:

mkdir build
cd build
cmake ..
make

By default, GPU is disabled. To enable GPU support (currently only CUDA is supported), run cmake -DENABLE_GPU=ON ... Ensure the CUDA toolkit is installed beforehand.

Android Cross Compilation:

Utilize the pre-prepared script tools/android_build.sh. First, set up the NDK path in the NDK_ROOT environment variable. For instance:

export NDK_ROOT=/path/to/ndk
./tools/android_build.sh

Running InferLLM

For running chatGLM models, follow the specified documentation. For local execution, run:

./chatglm -m chatglm-q4.bin -t 4

For mobile execution, transfer the model using adb and execute accordingly:

adb shell ./chatglm -m chatglm-q4.bin -t 4

To specify GPU for inference, use:

./chatglm -m chatglm-q4.bin -g GPU

Hardware Compatibility and Profiles

x86: Compatible with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz.
Android: Compatible with devices like Xiaomi9 (Qualcomm SM8150 Snapdragon 855).
CPU: Compatible with SG2042, with riscv-vector 0.7, 64 threads.

The team highly recommends using 4 threads based on the profiling results.

Supported Models

InferLLM currently supports several models:

Licensing

InferLLM is available under the Apache License, Version 2.0, making it accessible for use and contributions within the open-source community.