text-embeddings-inference - Enhance Text Embeddings Model Deployment with Efficient Inference and Docker Integration

Text Embeddings Inference: A Comprehensive Guide

Introduction

Text Embeddings Inference (TEI) is an innovative toolkit designed to deploy and serve open-source text embeddings and sequence classification models efficiently. It is tailored to provide a high-performance extraction for popular models like FlagEmbedding, Ember, GTE, and E5, making it suitable for various text-related machine learning tasks.

Notable Features

TEI is equipped with a myriad of features that enhance its functionality and ease of use:

Absence of Model Graph Compilation: Eliminates the need for complex compilation processes, streamlining deployment.
Metal Support: Allows for local execution on Apple Macs.
Lightweight Docker Images: Ensures quick startup times, promoting serverless operations.
Token-Based Dynamic Batching: Facilitates efficient handling of tokenized text data.
Optimized Code for Inference: Implements advanced technologies such as Flash Attention, Candle, and cuBLASLt to boost performance.
Efficient Weight Loading: Utilizes Safetensors for rapid and secure model weight loading.
Production-Ready Features: Includes distributed tracing with Open Telemetry and Prometheus metrics for enhanced monitoring.

Supported Models

TEI supports a diverse range of models suitable for both text embeddings and sequence classification tasks. This includes various BERT and RoBERTa-based models with different positional encodings, such as absolute positions for BERT and XLM-RoBERTa, and Alibi/Arope positions for JinaBERT and Mistral models. Notable models include:

7B Models: Like the Mistral and Qwen2 models from Salesforce and Alibaba, respectively.
Smaller Models: Such as the 0.3B Bert model from WhereIsAI and the 0.1B NomicBert for lightweight tasks.

For a comprehensive model evaluation, users can refer to the Massive Text Embedding Benchmark (MTEB) Leaderboard.

Deployment Options

TEI offers flexibility with its deployment options, including:

Docker

With simple Docker commands, users can set up and deploy their models efficiently. For example:

docker run --gpus all -p 8080:80 -v data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id BAAI/bge-large-en-v1.5

Local Installation

For those preferring a local setup, installing via Rust is straightforward. This option allows you to run TEI on your machine, especially useful for CPU-bound tasks.

Air-Gapped Deployment

TEI supports air-gapped deployment—ideal for environments without internet access. Users can pre-download model weights and use Docker volumes to deploy these models within a secure network.

Usage Scenarios

TEI supports a vast array of applications, from sentiment analysis using models like SamLowe/roberta-base-go_emotions to more complex reranking tasks with models like BAAI/bge-reranker-large. It also supports innovative pooling techniques, such as SPLADE pooling, to cater to advanced user requirements.

Support and Documentation

Users can access in-depth API documentation via Swagger, enabling them to integrate TEI effortlessly into their workflows. For setups using private or gated models, TEI provides clear instructions on using the HuggingFace API token for secure model access.

TEI also offers a high-performance gRPC API, an alternative to the default HTTP API, making it suitable for extensive operational environments requiring robust data handling across distributed systems.

Conclusion

Text Embeddings Inference provides a versatile and powerful tool for deploying text embeddings and sequence classification models. Its array of supported models, combined with flexible deployment options, makes it a strong candidate for various NLP tasks, meeting both computational and operational needs efficiently.