EET - Optimize Large Transformer Models Efficiently on a Single GPU

Easy and Efficient Transformer (EET) Project Introduction

Easy and Efficient Transformer (EET) is a highly user-friendly and efficient plugin designed for PyTorch. It primarily focuses on optimizing the use of Transformers, a popular model architecture in machine learning, to enable the deployment of very large models on a single GPU. This project makes it affordable and efficient to work with mega-sized models, a critical advancement for both researchers and developers dealing with complex natural language processing (NLP) and multi-modal tasks.

Key Features of EET

Latest Support: EET now includes support for popular large language models like Baichuan and LLaMA. It also supports LLMs (Large Language Models in general).
Int8 Quantization: Newly added capability to enhance performance without losing too much precision by using 8-bit integer computation.
Single GPU Usage: Efficiently run mega-sized models using just a single GPU - a cost-effective solution for model deployment.
High Performance: The plugin significantly accelerates transformer-based models using CUDA kernel optimization and advanced algorithms for quantization and sparsity.
Ease of Use: Out-of-the-box compatibility with both Transformers and Fairseq libraries, reducing the complexity of model configuration.

Model Compatibility

EET supports a broad spectrum of models, including:

GPT-3: Offers a speed increase of up to 8x.
Bert & ALBert: Gain performance boosts of up to 5x.
Roberta, T5, ViT, and more: Each experiencing improvements in efficiency and speed.

Getting Started Quickly

Basic Requirements

CUDA 11.4 or higher
Python 3.7 or higher
GCC 7.4.0 or higher
PyTorch 1.12.0 or higher
Numpy 1.19.1 or higher
Fairseq 0.10.0
Transformers library 4.31.0 or higher

These versions provide a minimum configuration for operations, though newer versions are recommended.

Installation

From Source: Clone the repository and proceed with a standard installation.
Using Docker: Build with Docker to simplify setup and avoid manual installations.

Running EET

There are three primary types of APIs that EET provides:

Operators APIs: Enable the definition of custom models by providing components like attention mechanisms and feedforward networks.
Model APIs: Allow integration with PyTorch projects directly, simplifying transformations and fair usage.
Application APIs: Facilitate easy execution of models with minimal lines of code for specific tasks, leveraging a familiar pipeline structure.

Performance and Benefits

EET promises significant performance improvements reflected in benchmarks with models like GPT-3 on A100 GPUs and Bert models on 2080ti GPUs. For example, the performance chart for Llama13B on a 3090 GPU highlights substantial throughput improvements.

Contributing and Further Information

For those interested in contributing to or utilizing EET in their research, the citation details are provided. For any queries, the EET team is reachable via GitHub issues or email. The project also features various examples and video demonstrations to assist with understanding and implementation.

This project is a groundbreaking step forward in making complex, large-scale AI models more accessible and practical for real-world application and experimentation.