PowerInfer
PowerInfer, a high-speed inference engine for Large Language Models, leverages consumer-grade GPUs for enhanced performance. Utilizing activation locality and a hybrid CPU/GPU model, it optimizes resource demands while maintaining efficiency. PowerInfer offers up to 11 times faster performance than llama.cpp, generating an average of 13.20 tokens per second, with peaks of 29.08 tokens per second, nearly matching professional servers. This architecture incorporates adaptive predictors and sparse operators, facilitating integration, backward compatibility, and efficient deployment on models like Falcon-40B and Bamboo-7B.