dash-infer
DashInfer is an optimized C++ runtime that ensures scalable and efficient inference for large language models (LLMs) across multiple hardware platforms, including x86 and ARMv9. It features Continuous Batching and NUMA-Aware support for enhanced CPU performance and minimal third-party dependencies for easy integration. High precision with GPU-level accuracy and support for open-source LLMs like Qwen and LLaMA make it a robust choice. Techniques like Post Training Quantization and Flash Attention further boost performance, ensuring low latency and high throughput in multi-node server setups.