airllm - Efficient Inference for Large Language Models on Low-End Hardware

Introduction to AirLLM

What is AirLLM?

AirLLM is a groundbreaking tool designed to optimize the way large language models (LLMs) are run, particularly focusing on minimizing inference memory usage. It allows for incredibly large language models, such as a 70 billion parameter model, to operate on a single 4GB GPU. This is achieved without resorting to traditional methods like quantization, distillation, or pruning, which typically reduce the model's size or complexity. Remarkably, AirLLM also supports running the enormous 405 billion parameter Llama3.1 model on an 8GB VRAM.

Key Features of AirLLM

Resource Efficiency: AirLLM enables large language models to execute on lower-end hardware without compromising on accuracy or performance. This includes large models that typically require much more powerful hardware.
Wide Compatibility: The tool operates on various systems, including MacOS, expanding its usability across different platforms.
Model Compression: AirLLM employs model compression techniques, such as block-wise quantization, which can speed up inference processing by up to three times. This feature comes with minimal accuracy loss, ensuring efficient and effective model execution.
Support for Multiple Models: Initially designed to support Llama models, AirLLM has expanded its compatibility to include other popular models like ChatGLM, QWen, Baichuan, and Mistral, further broadening its potential applications.

Getting Started with AirLLM

Installation

To begin using AirLLM, you can install the package via pip:

pip install airllm

Running Inference

Once installed, you can initiate an AirLLM model, specify the model's repository ID from Hugging Face, or use a local model path. This process is straightforward and similar to using any regular transformer model.

Here's a brief example of how you might run an inference:

from airllm import AutoModel

MAX_LENGTH = 128
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

input_text = ['What is the capital of United States?']

input_tokens = model.tokenizer(input_text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)

generation_output = model.generate(input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Advanced Features

Model Compression

The model compression feature uses a method called block-wise quantization to enhance speed dramatically without significantly losing accuracy. By installing additional packages like bitsandbytes, you can further unlock enhanced performance:

pip install -U bitsandbytes
pip install -U airllm

When initializing your model, you can specify compression='4bit' or compression='8bit' to take advantage of this feature.

Additional Configurations

AirLLM supports various configurations during the model initialization such as:

Compression: To choose whether to use 4-bit or 8-bit quantization.
Profiling Mode: To output time consumption data.
Hugging Face Token: For gated model access.
Prefetching: To overlap model loading and computation for better performance.

Running on MacOS

AirLLM supports running on MacOS systems with a simple setup similar to other platforms. Ensure that necessary libraries like mlx and torch are installed. Note that this support is available for Apple silicon.

Example Notebooks

AirLLM provides detailed example notebooks that demonstrate how to use different models such as ChatGLM, QWen, and others. These examples can be found on platforms like Google Colab for easy experimentation and understanding.

Contributing and Community

AirLLM thrives on community support and welcomes contributions, ideas, and discussion. Users are encouraged to support the project through GitHub sponsorships or by other means like Patreon.

Citing AirLLM

For researchers or users wishing to cite AirLLM in their work, a BibTeX entry is provided to facilitate this.

AirLLM represents a significant advancement in making powerful language models accessible on lower-end hardware, thereby democratizing access to cutting-edge machine learning capabilities. If you find this tool useful, consider contributing or supporting the development team.