DashInfer Project Overview
DashInfer is an advanced runtime solution crafted to enhance the performance of large language models (LLMs) with minimal resource requirements. Written primarily in C++, this software has been finely optimized for different hardware architectures such as x86 and ARMv9, ensuring versatility and high efficiency.
Key Features of DashInfer
-
Lightweight and Integrative: DashInfer demands minimal external dependencies, utilizing static linking for its libraries. It offers interfaces in both C++ and Python, thereby facilitating seamless integration into existing systems.
-
Accuracy and Precision: After rigorous testing, DashInfer is able to deliver results with the same level of accuracy as PyTorch and other GPU engines.
-
Efficient Inference Techniques: It employs techniques such as Continuous Batching, which supports rapid request handling and streaming outputs, and asynchronous interfaces allowing tailored control for each request.
-
Broad Compatibility: It supports popular open-source LLMs like Qwen, LLaMA, and ChatGLM, and is compatible with models formatted for Huggingface.
-
Post Training Quantization (PTQ): With its InstantQuant feature, DashInfer can perform weight-only quantization without requiring fine-tuning, optimizing the deployment process without sacrificing model accuracy. It currently supports 8-bit quantization on ARM CPUs.
-
Optimized Performance: By leveraging advanced computation kernels like OneDNN and custom assembly kernels, DashInfer maximizes platform-specific hardware performance.
-
Flash Attention Support: This significantly speeds up attention mechanisms in the model, reducing latency in processing the initial token considerably.
-
NUMA-Aware Design: It can perform tensor parallel inference across multiple NUMA nodes, making full use of server-grade CPUs while preventing degradation in cross-node data access.
-
Extended Context Lengths: Supports context lengths of up to 32k, with plans to extend this even further.
-
Flexible API Interfaces: Provides APIs in both C++ and Python, and can be extended to other languages like Java and Rust through cross-language standards.
-
OS Compatibility: Designed for Linux operating systems such as Centos7 and Ubuntu 22.04, with available Docker images for easy deployment.
DashInfer Demonstration
A demonstration using DashInfer can be found on the ModelScope platform, showcasing its capabilities with a model named Qwen1.5-7B-Chat, powered by an x86 architecture on an Aliyun ECS instance.
Software Architecture
DashInfer uses a robust software architecture that involves:
-
Model Loading and Serialization: Converts models into a format optimized for DashInfer (.dimodel, .ditensors) necessary for inference without PyTorch dependency.
-
Inference Execution: Utilizes DLPack tensor formats for interaction with external frameworks, with a focus on multi-threading for single-NUMA architecture and a multi-process client-server setup for multi-NUMA architecture.
For detailed implementation and coding examples, DashInfer offers a comprehensive guide covering both Python and C++ contexts.
Performance and Compatibility
DashInfer's performance is verified through rigorous tests, confirming its high efficiency and accuracy, rivaling existing GPU-based solutions. It includes support for FP32, BF16 on x86 CPUs, and additional capabilities like InstantQuant on ARM CPUs.
Future Enhancements
Planned enhancements for DashInfer include adding support for 4-bit quantization, compatibility with models fine-tuned using GPTQ, and incorporating MoE architectures.
Licensing
DashInfer is open-source, licensed under the Apache 2.0 license, allowing for widespread use and adaptation in various projects.
In summary, DashInfer offers a high-performance, low-dependency solution for executing large language models, with a focus on compatibility, accuracy, and efficiency across different hardware and software environments.