InferLLM
InferLLM offers a streamlined framework for LLM model inference, inspired by llama.cpp, providing kernel optimization and specialized KVstorage for efficient model handling. Compatible with various architectures including Arm, x86, and CUDA, it supports Chinese and English int4 models for versatile deployment on desktops and mobile devices. Recent updates include support for the LLama-2-7B model and enhanced performance on Arm architecture. Integration capabilities include popular models such as ChatGLM, Alpaca, and Baichuan.