ScaleLLM
A cutting-edge inference system designed for large language models, utilizing advanced techniques such as tensor parallelism and OpenAI-compatible APIs. It supports leading open-source models like Llama3.1 and GPT-NeoX, aiming for seamless production deployment with high efficiency through tools like Flash Attention and Paged Attention. The system is under active development, introducing enhancements like CUDA Graph, Prefix Cache, and Speculative Decoding. Easy installation via PyPI, offering customization and a flexible server for various tasks, ideal for performance and scalability needs.