Text Embeddings Inference: A Comprehensive Guide
Introduction
Text Embeddings Inference (TEI) is an innovative toolkit designed to deploy and serve open-source text embeddings and sequence classification models efficiently. It is tailored to provide a high-performance extraction for popular models like FlagEmbedding, Ember, GTE, and E5, making it suitable for various text-related machine learning tasks.
Notable Features
TEI is equipped with a myriad of features that enhance its functionality and ease of use:
- Absence of Model Graph Compilation: Eliminates the need for complex compilation processes, streamlining deployment.
- Metal Support: Allows for local execution on Apple Macs.
- Lightweight Docker Images: Ensures quick startup times, promoting serverless operations.
- Token-Based Dynamic Batching: Facilitates efficient handling of tokenized text data.
- Optimized Code for Inference: Implements advanced technologies such as Flash Attention, Candle, and cuBLASLt to boost performance.
- Efficient Weight Loading: Utilizes Safetensors for rapid and secure model weight loading.
- Production-Ready Features: Includes distributed tracing with Open Telemetry and Prometheus metrics for enhanced monitoring.
Supported Models
TEI supports a diverse range of models suitable for both text embeddings and sequence classification tasks. This includes various BERT and RoBERTa-based models with different positional encodings, such as absolute positions for BERT and XLM-RoBERTa, and Alibi/Arope positions for JinaBERT and Mistral models. Notable models include:
- 7B Models: Like the Mistral and Qwen2 models from Salesforce and Alibaba, respectively.
- Smaller Models: Such as the 0.3B Bert model from WhereIsAI and the 0.1B NomicBert for lightweight tasks.
For a comprehensive model evaluation, users can refer to the Massive Text Embedding Benchmark (MTEB) Leaderboard.
Deployment Options
TEI offers flexibility with its deployment options, including:
Docker
With simple Docker commands, users can set up and deploy their models efficiently. For example:
docker run --gpus all -p 8080:80 -v data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id BAAI/bge-large-en-v1.5
Local Installation
For those preferring a local setup, installing via Rust is straightforward. This option allows you to run TEI on your machine, especially useful for CPU-bound tasks.
Air-Gapped Deployment
TEI supports air-gapped deployment—ideal for environments without internet access. Users can pre-download model weights and use Docker volumes to deploy these models within a secure network.
Usage Scenarios
TEI supports a vast array of applications, from sentiment analysis using models like SamLowe/roberta-base-go_emotions
to more complex reranking tasks with models like BAAI/bge-reranker-large
. It also supports innovative pooling techniques, such as SPLADE pooling, to cater to advanced user requirements.
Support and Documentation
Users can access in-depth API documentation via Swagger, enabling them to integrate TEI effortlessly into their workflows. For setups using private or gated models, TEI provides clear instructions on using the HuggingFace API token for secure model access.
TEI also offers a high-performance gRPC API, an alternative to the default HTTP API, making it suitable for extensive operational environments requiring robust data handling across distributed systems.
Conclusion
Text Embeddings Inference provides a versatile and powerful tool for deploying text embeddings and sequence classification models. Its array of supported models, combined with flexible deployment options, makes it a strong candidate for various NLP tasks, meeting both computational and operational needs efficiently.