rtp-llm
Created by Alibaba's Foundation Model Inference Team, the rtp-llm inference engine is engineered for high-performance acceleration of large language models across Alibaba platforms such as Taobao and Tmall. It features optimized CUDA kernels and broad hardware support, including AMD ROCm and Intel CPUs, and integrates seamlessly with HuggingFace models. The engine supports multi-machine, multi-GPU parallelism and introduces features like contextual prefix caches and speculative decoding, enhancing deployment efficiency on Linux with NVIDIA GPUs. Explore its proven reliability and broad usage in Alibaba's AI projects.