rknn-llm - Streamlined AI Model Integration for Rockchip NPU

Introduction to RKNN-LLM Project

The RKNN-LLM project offers a powerful software stack designed to efficiently deploy AI models on Rockchip chips, particularly utilizing the RKNPU (Rockchip Neural Processing Unit) for enhanced performance. The RKNN-LLM framework simplifies model deployment and equips developers with the necessary tools to convert, infer, and optimize large language models (LLMs).

Overview of the RKNN-LLM Framework

The RKNN-LLM project consists of several key components:

RKLLM-Toolkit: This is a software development kit (SDK) that users employ on their PCs. It helps in converting trained AI models into a specific RKLLM format suitable for deployment on Rockchip devices. The toolkit also supports model quantization, which enhances performance.
RKLLM Runtime: This component offers C/C++ programming interfaces for the Rockchip NPU platform. It allows users to deploy RKLLM models quickly, facilitating the acceleration of LLM applications by leveraging NPU's computational capabilities.
RKNPU Kernel Driver: This driver interacts directly with the NPU hardware. It is open-source and integrated within the Rockchip kernel code, providing developers with access to low-level performance enhancements.

Supported Platforms

RKNN-LLM supports various Rockchip series, including:

RK3588 Series
RK3576 Series

These platforms are equipped to handle demanding AI workloads efficiently.

Supported Models

A variety of AI models are supported by the RKNN-LLM project, allowing a broad range of applications and use cases. The supported models include:

LLAMA models
TinyLLAMA models
Qwen models
Phi models
ChatGLM3-6B
Gemma models
InternLM2 models
MiniCPM models

These models cater to different performance and size needs, providing flexibility for developers.

Model Performance and Benchmarking

The RKNN-LLM project includes a comprehensive performance benchmark for various models on supported platforms. This benchmark illustrates the efficiency of model inference in terms of Time-To-First-Token (TTFT), tokens processed per second, and memory usage on different Rockchip platforms.

Key Features and Updates

The recent version 1.1.0 of RKNN-LLM introduces several improvements and features:

Group-wise quantization with customizable group sizes for more precise performance tuning.
Support for joint inference, including the loading of LoRA models.
Enhanced prompt caching capabilities for faster model response times.
New support for gguf model conversion with q4_0 and fp16.
Optimizations in initialization, prefill, and decoding stages.
Broadened input type support, accommodating prompts, embeddings, tokens, and multimodal inputs.
Enhanced quantization algorithms and resolved issues such as catastrophic forgetting when token numbers exceed the maximum context.

Downloads and Resources

The latest version of RKNN-LLM and relevant resources, including docker images, examples, and documentation, can be downloaded from the RKLLM_SDK page with the fetch code rkllm.

Compatibility and Notes

Version 1.1.0 introduces significant changes, rendering it incompatible with older models. Hence, using the latest toolchain for model conversion and inference is recommended.
Supported Python versions are 3.8 and 3.10.
The latest release version is v1.1.1, which can be found on GitHub.

The RKNN-LLM project provides a robust framework and toolkit for developers aiming to deploy AI models efficiently on Rockchip platforms, supported by comprehensive documentation and community resources. For those exploring additional AI deployment capabilities, the RKNN-Toolkit2 offers further expansion and functionality.