Introduction to Text Generation Inference (TGI)
Text Generation Inference (TGI) is an innovative toolkit designed to facilitate high-performance text generation by deploying and serving Large Language Models (LLMs). TGI is widely used in production environments, powering notable platforms such as Hugging Chat, the Inference API, and Inference Endpoint at Hugging Face. It supports several prominent open-source LLMs like Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX, providing a seamless and efficient solution for various text generation tasks.
Key Features
- Simple Deployment: TGI offers a user-friendly launcher to deploy and serve popular LLMs effortlessly.
- Production-Ready: Equipped with features like distributed tracing with OpenTelemetry and Prometheus metrics, TGI ensures readiness for robust production environments.
- Tensor Parallelism: Accelerates inference with support for executing tasks on multiple GPUs.
- Token Streaming: Utilizes Server-Sent Events (SSE) for efficient token streaming during text generation.
- Batch Processing: Supports continuous batching to enhance overall throughput for incoming requests.
- API Compatibility: Offers a Messages API compatible with Open AI Chat Completion API for seamless integration.
- Optimized Code: Incorporates optimized transformer code using Flash Attention and Paged Attention to enhance performance.
- Quantization: Supports various quantization techniques like bitsandbytes, GPT-Q, EETQ, AWQ, Marlin, and FP8 to reduce resource usage.
- Security and Flexibility: Implements features like watermarking and logits warper for secure operation and customizable text generation experiences.
- Fine-tuning and Custom Prompts: Supports fine-tuning for specific tasks and generating custom prompts to guide model output.
Hardware Support
TGI is compatible with diverse hardware, ensuring flexibility and scalability. It supports Nvidia and AMD GPUs, Google TPU, Intel GPU, Gaudi, and AWS Inferentia, making it adaptable to various technological ecosystems.
Getting Started
Using Docker
To start using TGI, you can deploy it with a Docker container. Here is a brief overview of the process:
- Set up a Docker environment.
- Run the Docker command to initialize the container with your chosen model.
- Make requests to the server for text generation tasks.
For NVIDIA GPU support, ensure you have the NVIDIA Container Toolkit installed and use CUDA version 12.2 or higher.
Local Installation
For those who prefer local installation, TGI can also be set up with Python and Rust environments after installing the required dependencies, such as Protoc and OpenSSL libraries.
Optimized Architectures
TGI is designed to work seamlessly with modern models. It can be adjusted for best-effort support with less common architectures using automatic model adaptation.
Develop and Test
TGI offers robust facilities for development and testing, allowing users to run and test both Python and Rust-based components effectively.
Quantization
TGI supports on-the-fly quantization and pre-quantized weights to minimize VRAM usage. It also facilitates 4-bit quantization using NF4 and FP4 data types for efficient processing.
The Text Generation Inference project thus represents a powerful and versatile solution for deploying large language models across various environments, providing both efficiency and flexibility in generating human-like text outputs.