dash-infer - Enables Efficient LLM Inference with Optimized C++ Runtime on x86 and ARM Platforms

DashInfer Project Overview

DashInfer is an advanced runtime solution crafted to enhance the performance of large language models (LLMs) with minimal resource requirements. Written primarily in C++, this software has been finely optimized for different hardware architectures such as x86 and ARMv9, ensuring versatility and high efficiency.

Key Features of DashInfer

Lightweight and Integrative: DashInfer demands minimal external dependencies, utilizing static linking for its libraries. It offers interfaces in both C++ and Python, thereby facilitating seamless integration into existing systems.
Accuracy and Precision: After rigorous testing, DashInfer is able to deliver results with the same level of accuracy as PyTorch and other GPU engines.
Efficient Inference Techniques: It employs techniques such as Continuous Batching, which supports rapid request handling and streaming outputs, and asynchronous interfaces allowing tailored control for each request.
Broad Compatibility: It supports popular open-source LLMs like Qwen, LLaMA, and ChatGLM, and is compatible with models formatted for Huggingface.
Post Training Quantization (PTQ): With its InstantQuant feature, DashInfer can perform weight-only quantization without requiring fine-tuning, optimizing the deployment process without sacrificing model accuracy. It currently supports 8-bit quantization on ARM CPUs.
Optimized Performance: By leveraging advanced computation kernels like OneDNN and custom assembly kernels, DashInfer maximizes platform-specific hardware performance.
Flash Attention Support: This significantly speeds up attention mechanisms in the model, reducing latency in processing the initial token considerably.
NUMA-Aware Design: It can perform tensor parallel inference across multiple NUMA nodes, making full use of server-grade CPUs while preventing degradation in cross-node data access.
Extended Context Lengths: Supports context lengths of up to 32k, with plans to extend this even further.
Flexible API Interfaces: Provides APIs in both C++ and Python, and can be extended to other languages like Java and Rust through cross-language standards.
OS Compatibility: Designed for Linux operating systems such as Centos7 and Ubuntu 22.04, with available Docker images for easy deployment.

DashInfer Demonstration

A demonstration using DashInfer can be found on the ModelScope platform, showcasing its capabilities with a model named Qwen1.5-7B-Chat, powered by an x86 architecture on an Aliyun ECS instance.

Software Architecture

DashInfer uses a robust software architecture that involves:

Model Loading and Serialization: Converts models into a format optimized for DashInfer (.dimodel, .ditensors) necessary for inference without PyTorch dependency.
Inference Execution: Utilizes DLPack tensor formats for interaction with external frameworks, with a focus on multi-threading for single-NUMA architecture and a multi-process client-server setup for multi-NUMA architecture.

For detailed implementation and coding examples, DashInfer offers a comprehensive guide covering both Python and C++ contexts.

Performance and Compatibility

DashInfer's performance is verified through rigorous tests, confirming its high efficiency and accuracy, rivaling existing GPU-based solutions. It includes support for FP32, BF16 on x86 CPUs, and additional capabilities like InstantQuant on ARM CPUs.

Future Enhancements

Planned enhancements for DashInfer include adding support for 4-bit quantization, compatibility with models fine-tuned using GPTQ, and incorporating MoE architectures.

Licensing

DashInfer is open-source, licensed under the Apache 2.0 license, allowing for widespread use and adaptation in various projects.

In summary, DashInfer offers a high-performance, low-dependency solution for executing large language models, with a focus on compatibility, accuracy, and efficiency across different hardware and software environments.