#quantization
curated-transformers
Curated Transformers is a PyTorch library providing modular transformer models that ensure efficient feature reuse and easy scalability. It supports large language models such as Falcon, Llama, and Dolly v2 and works seamlessly within the PyTorch environment while maintaining minimal dependencies. With straightforward type annotations, it suits educational purposes and integrates well with type-checked codebases. Used by Explosion and set as the default in spaCy 3.7, it offers compatibility with diverse architectures like BERT and GPT variants through Huggingface Hub.
rwkv.cpp
The project ports RWKV language model architecture to ggml, supporting FP32, FP16, and various quantized inferences like INT4, INT5, and INT8. Primarily CPU-focused, it includes both a C library and a Python wrapper, with optional cuBLAS support. It supports RWKV versions 5 and 6, providing competitive alternatives to Transformer models, especially for extensive contexts, and accommodates LoRA checkpoint integration, offering detailed performance metrics for efficient computations.
llama3.java
Explore the capabilities of Llama 3 inference in a single Java file, with features such as GGUF parsing and Java's Vector API for enhanced performance. This project supports versions 3.1 and 3.2 with optimized tokenization and quantization models Q8_0 and Q4_0, facilitating advanced compiler testing on the JVM platform. The straightforward setup and native image support enable rapid execution and varied CLI functionalities.
CTranslate2
CTranslate2 facilitates efficient Transformer model inference with techniques like weight quantization and layer fusion, optimizing memory usage on both CPU and GPU. Supporting encoder-decoder, decoder-only, and encoder-only model types, it integrates seamlessly with multiple frameworks. Features include fast execution, reduced precision support, and minimal storage needs. The library's capabilities, such as automatic CPU selection, parallel processing, and ease of use in Python and C++, make it a reliable option for production use, distinctly outperforming general-purpose deep learning frameworks.
fsdp_qlora
The project focuses on efficient large language model training using Quantized LoRA with FSDP, supporting platforms like Axolotl. Installation is tailored for CUDA versions up to 12.1, emphasizing low memory usage and mixed precision. Various training types are available, suitable for models like Llama-2 70B. This alpha aims to refine techniques for better model handling.
optimum-intel
Discover how Intel's tools enhance AI model performance through integration with Hugging Face libraries. Supports Intel Extension for PyTorch for improved efficiency, Intel Neural Compressor for model compression, and OpenVINO for robust inference. Apply quantization and pruning techniques to optimize workflows on Intel hardware, with dynamic installation options and comprehensive usage examples.
awesome-compression
Explore beginner-level insights into model compression with guidance from MIT's TinyML courses. The project offers detailed explanations of pruning, quantization, and knowledge distillation techniques, aiming to decrease the resource usage of large language models. Catering to deep learning researchers, AI developers, and students, it includes theoretical insights and practical code applications, ideal for mobile and embedded systems. Access extensive Chinese-language resources, refine your understanding, and engage with the AI community.
MegEngine
MegEngine is a versatile deep learning framework known for its unified approach to training and inference, prioritizing efficiency and usability. It reduces GPU memory usage significantly and is compatible with platforms such as x86, Arm, and CUDA. MegEngine is designed for effective inference with low hardware requirements, utilizing advanced features to improve performance. It supports various operating systems and offers easy installation through pip, fostering community involvement and AI innovation. Comprehensive documentation and tools aid in optimizing models on multiple platforms.
Chinese-Llama-2-7b
Chinese Llama 2 7B offers an open-source version of the LLaMA2 model, fully commercializable and featuring bilingual text-to-speech and text-to-vision datasets. Optimized for LLaMA-2-chat integration, it supports advanced multimodal applications. Regular updates provide new model resources, quantized versions, GGML models, and deployment solutions like Docker and API. Resources are accessible on HuggingFace, Baidu, and Colab for both CPU and GPU users.
gguf-tools
The gguf-tools library facilitates the manipulation and documentation of GGUF files, essential in the local machine learning landscape. The tools include utilities for displaying detailed GGUF information, analyzing tensor differences between models, and examining tensor weights. Although in active development, the project provides real-world applications, with experimental features still under refinement. While user-friendly and well-documented, its current limitations include missing quantization formats. Discover its potential applications in 'llama.cpp' and other machine learning projects.
chatglm.cpp
ChatGLM.cpp provides a C++ implementation for real-time interactions with models like ChatGLM-6B across diverse hardware, including GPUs from NVIDIA and Apple Silicon. It features memory-efficient quantization, optimized KV caching, and supports finetuned models like P-Tuning v2 and LoRA across Linux, MacOS, and Windows. Python bindings, web demos, and different chat modes are also supported. BLAS and GPU accelerators such as CUDA and Metal optimize performance. Installation through PyPI enhances accessibility.
ppq
This advanced framework facilitates neural network quantization across various hardware platforms by transforming floating-point operations into fixed-point, enhancing chip design efficiency. It offers customizable quantization processes compatible with TensorRT and OpenVINO. The 0.6.6 version introduces FP8 quantization, upgraded Python APIs, and sophisticated graph fusion, providing adaptable solutions for evolving AI applications.
KIVI
KIVI significantly optimizes LLM memory usage and enhances throughput by using a tuning-free 2bit quantization strategy for KV caches. This method decreases memory needs by 2.6 times, enabling larger batch sizes and improving throughput up to 3.47 times. Compatible with models like Llama-2 and Mistral, KIVI maintains model quality while solving speed and memory challenges in inference tasks. Discover new features and examples on GitHub, including improvements for HuggingFace Transformers and support for the Mistral models.
AutoGPTQ
Discover the advanced weight-only quantization package based on the GPTQ algorithm, featuring user-friendly APIs to enhance LLM efficiency. Recent updates highlight integration with renowned AI libraries like 🤗 Transformers, and improved processing capabilities through the Marlin int4*fp16 kernel. AutoGPTQ facilitates a variety of quantization and inference models, optimizing metrics such as inference speed and model perplexity, and is available for Linux and Windows platforms, supporting various GPUs including CUDA and ROCm systems. Ideal for developers aiming to optimize AI model deployment and manage computational costs while ensuring model accuracy.
airllm
AirLLM facilitates large language model operation by minimizing hardware demands. It allows 70B models to run on 4GB GPUs and up to 405B models on 8GB GPUs through advanced model compression, without requiring quantization, distillation, or pruning. Recent updates include support for Llama3, CPU inference, and compatibility with ChatGLM and Qwen models.
optimum
Optimum provides optimization tools to improve model training and inference efficiency across multiple hardware platforms. Supporting frameworks like ONNX Runtime, OpenVINO, and TensorFlow Lite, it ensures easy integration and performance improvement. Techniques such as graph optimization, post-training quantization, and QAT can be applied for better model execution. Optimum eases installation and deployment with configurations for Intel, Nvidia, AWS, and more, facilitating model exportation, quantization, and execution optimization with advanced hardware.
Telechat
TeleChat, a semantic large language model developed by China Telecom's AI unit, includes the TeleChat-1B, 7B, and 12B models which are open-source and trained on vast multilingual data. TeleChat-12B features improvements in structure and training that enhance performance in areas like Q&A, coding, and mathematics without exaggeration. The models support advanced deep learning techniques and excel in reasoning, understanding, and long-text generation for a range of uses.
brocolli
Brocolli is a discontinued tool for converting PyTorch models to Caffe and ONNX using Torch FX. Although it is no longer maintained, it provides detailed instructions for model conversion. It supports model quantization and works well with popular models, such as AlexNet. A QQ group is available for community interaction and user support.
bitsandbytes
Bitsandbytes library provides an efficient Python interface for CUDA functions, featuring 8-bit optimizers, matrix multiplication, and quantization for 8-bit and 4-bit operations. It extends support to multiple backends like AMD GPUs and Intel processors, improving cross-platform functionality. The recent alpha release showcases its commitment to expanding hardware compatibility, with ongoing efforts for Windows and future Apple Silicon support, inviting constructive community feedback for continual enhancement.
gpu_poor
The tool estimates required GPU memory and token throughput for large language models (LLMs) on multiple GPUs and CPUs. It provides a detailed memory usage analysis for both training and inference, supporting quantization tools such as GGML, bitsandbytes, and frameworks like vLLM, llama.cpp, and HF. Key functionalities include vRAM requirements, token rate calculation, and finetuning duration approximation. The tool assists in assessing quantization suitability, maximum context capacity, and batch-size capability for GPUs, offering valuable insights into GPU memory optimization.
llama.onnx
Access LLaMa and RWKV models in ONNX format to enhance inference efficiency on devices with limited memory. This project bypasses the need for torch or transformers, supports memory pooling, and is compatible with FPGA/NPU/GPGPU hardware, enabling streamlined conversion to fp16 or TVM.
model-optimization
TensorFlow Model Optimization Toolkit provides efficient solutions for machine learning model optimization with techniques like quantization and pruning. It is designed for users with different experience levels and includes stable Python APIs and Keras support. Find extensive guides, tutorials, and documentation to improve model deployment and performance. For more details, installation guides, and updates, visit tensorflow.org/model_optimization. Contributions are encouraged following the TensorFlow code of conduct.
llm-compressor
This library facilitates integration with Hugging Face models and optimizes deployment using quantization algorithms. Notable features include support for safetensors-based formats and compatibility with large models through accelerate. It offers a variety of quantization options like W8A8, Mixed Precision, and SparseGPT. Algorithms such as SmoothQuant and GPTQ are readily applicable for activating and weighting. Discover comprehensive examples and user guides for rapid model deployment and execution using llmcompressor with vllm, promoting swift inference and model efficiency.
VILA
This visual language model utilizes large-scale interleaved image-text data to support video understanding and multi-image reasoning, featuring capabilities such as in-context learning and visual chain-of-thought. It supports efficient deployment with 4bit quantization across diverse hardware, offering high performance in tasks like video reasoning and image-question answering. The model is recognized on multiple leaderboards and is part of an extensive open-source ecosystem.
ao
Torchao provides effective solutions for PyTorch users to optimize inference and training through quantization and sparsity, enhancing model efficiency. It enables significant speed and memory improvements with weight and activation quantization. For training, it introduces Float8 data types and sparse training, ensuring resource efficiency. Its compatibility with PyTorch's `torch.compile()` and FSDP2 facilitates integration into existing workflows while supporting custom kernel development and experimental features. Suitable for researchers and developers looking to enhance performance while maintaining accuracy.
Feedback Email: [email protected]