en

#Compression

Explore LLM-Pruner, an efficient tool for structurally pruning large language models with minimal data. Supports models like Llama, Vicuna, and BLOOM, focusing on preserving multi-task ability and enhancing performance, now including GQA and Llama3 series.

OmniQuant presents a comprehensive quantization technique specifically designed for large language models. This approach performs effectively in both weight-only and weight-activation quantization, ensuring accurate results with configurations such as W4A16, W3A16, among others. Users can utilize pre-trained models like LLaMA and Falcon from the OmniQuant model zoo to generate quantized weights. The release includes quantization algorithms such as PrefixQuant and EfficientQAT, which enhance static activation quantization and optimize time-memory efficiency. OmniQuant’s weight compression technology reduces memory requirements, supporting efficient inference on GPUs and mobile devices, such as executing LLaMa-2-Chat with W3A16g128 quantization. Explore the quantization process with detailed resources and scripts for specific computational settings.

Terms of Use Privacy Policy Advertising Services

Feedback Email: [email protected]