Introduction to the rtp-llm Project
Latest Updates
In June 2024, the rtp-llm team unveiled an exciting update which brings several improvements including a revamped scheduling and batching framework crafted in C++, enhanced GPU memory management and allocation tracking, and an innovative Device backend. Additionally, the team is collaborating with hardware manufacturers to support various hardware backends such as AMD ROCm, Intel CPU, and ARM CPU in future releases.
Overview
The rtp-llm is a high-performance Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It plays a pivotal role within Alibaba Group by supporting LLM services across various business divisions like Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada. As a sub-project of havenask, rtp-llm is a crucial component in Alibaba's technological arsenal.
Key Features
Proven Production Capabilities
- rtp-llm has been implemented in multiple LLM applications, proving its reliability and efficiency. Some notable deployments include:
- Taobao Wenwen
- Alibaba's international AI platform, Aidge
- OpenSearch LLM Smart Q&A Edition
- Long-tail Query Rewriting in Taobao Search
Superior Performance
- The engine leverages advanced CUDA kernels like PagedAttention, FlashAttention, and FlashDecoding.
- It includes features such as WeightOnly INT8 Quantization and WeightOnly INT4 Quantization, facilitated through GPTQ and AWQ.
- rtp-llm has detailed optimizations in dynamic batching overhead at the framework level, specifically fine-tuned for the V100 GPU.
Flexibility and User-Friendliness
- It integrates seamlessly with HuggingFace models, supporting various weight formats including SafeTensors, Pytorch, and Megatron.
- The platform supports multimodal inputs, combining images and text, and allows for multi-machine/multi-GPU tensor parallelism.
- In addition, it offers P-tuning model support and the deployment of multiple LoRA services with a single model instance.
Advanced Acceleration Techniques
- It supports the loading of pruned irregular models and features like Contextual Prefix Cache for dialogues, System Prompt Cache, and Speculative Decoding.
- The Medusa module provides sophisticated parallelization strategies.
Getting Started
Requirements
- Operating System: Linux
- Python: 3.10
- NVIDIA GPU: Compute Capability 7.0 or higher (e.g., RTX20xx, RTX30xx, RTX40xx, V100, T4, etc.)
Starting Example
To start using rtp-llm, users can either set up a Docker container or install it using pip. Detailed setup instructions including Docker images and whl packages are provided for various CUDA environments.
Documentation and Support
Comprehensive documentation is available, offering detailed guides on deployment, usage, multi-GPU inference, and more. The project acknowledges numerous open-source contributions and provides a road map for future development. For any inquiries or assistance, community support is available via DingTalk and WeChat groups.
Conclusion
rtp-llm stands as a pivotal tool for accelerating LLM inferences at Alibaba. With its impeccable performance, flexibility, and cutting-edge features, it continues to facilitate advanced AI applications, reinforcing Alibaba's position in the global technology landscape.