airllm
AirLLM facilitates large language model operation by minimizing hardware demands. It allows 70B models to run on 4GB GPUs and up to 405B models on 8GB GPUs through advanced model compression, without requiring quantization, distillation, or pruning. Recent updates include support for Llama3, CPU inference, and compatibility with ChatGLM and Qwen models.