fastllm
Fastllm is a pure C++ library requiring no third-party dependencies, ensuring high-performance inference across platforms like ARM, X86, and NVIDIA. It supports Hugging Face model quantization and OpenAI API server setups, aiding multi-GPU and CPU deployments with dynamic batching. Featuring a front-end and back-end separation for better device compatibility, it integrates with models such as ChatGLM and LLAMA. Python support also allows custom model structures with extensive documentation for straightforward setup and use.