infinity - High-Performance REST API for Text Embeddings and Model Deployment

Infinity Project Introduction

Overview

Infinity is an advanced REST API designed for high-throughput and low-latency environments, specifically serving text-embeddings, reranking models, and multi-modal tasks such as clip, clap, and colpali. This versatile API was developed under the MIT License, ensuring it is open-source and accessible for a wide array of applications.

Key Features

Model Deployment from HuggingFace

Infinity allows users to deploy a wide range of models from HuggingFace, a popular platform for model sharing. Users can deploy embeddings, reranking, clip, and sentence-transformer models, ensuring flexibility and adaptability to various needs.

Fast Inference Backends

The inference server of Infinity is optimized for performance. It leverages powerful frameworks like Torch, Optimum (ONNX/TensorRT), and CTranslate2, using FlashAttention technology. This ensures efficient use of different hardware accelerators, including NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or APPLE MPS.

Multi-modal and Multi-model Orchestration

Infinity stands out by handling multiple types and mixes of models, offering a robust solution for complex data environments that require handling various types of input simultaneously.

Tested and Reliable Implementation

Infinity guarantees reliability through rigorous unit and end-to-end testing, ensuring that embeddings processed through the API are accurately handled, allowing users to create embeddings efficiently and reliably.

Ease of Use

Built on FastAPI, Infinity provides a user-friendly environment for both beginners and experienced developers. Its command-line interface (CLI) and alignment with OpenAPI specifications facilitate easy integration and operation. The comprehensive documentation available makes it straightforward to get started.

Latest Developments

Infinity continuously evolves with the addition of new features and improvements. Recent updates include:

Deployment via Modal and Free GPU Deployment (July 2024)
Support for Multi-modal Tasks (June 2024)
CLI Enhancements and API Key Integration (May 2024)
Experimental INT8 and FP8 Support (March 2024)
Documentation Updates and Community Engagement (February 2024)

Getting Started

Installing and Running CLI

Users can quickly get involved with Infinity by using Python's pip package manager to install the CLI, allowing direct execution of models via environment variables or command-line arguments.

pip install infinity-emb[all]
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

Using Docker for Deployment

For those preferring containerized applications, Infinity can be run directly through a pre-built Docker container, ensuring streamlined deployment while efficiently utilizing hardware accelerators.

docker run -it --gpus all -v $volume:/app/.cache -p $port:$port michaelf34/infinity:latest v2 --model-id $model1 --model-id $model2 --port $port

Sample Use Cases

Infinity caters to a wide range of use cases, including:

Text Embeddings: Transforming text into dense vectors for tasks like retrieval and semantic search.
Reranking: Assessing document similarity relative to a given query.
Multi-modal Processing: Handling images, audio, and text concurrently through models like CLIP and CLAP.
Text Classification: Performing sentiment analysis and emotion detection using classification models.

Integrations and Community

Infinity seamlessly integrates into various platforms such as Langchain and serverless deployments with Runpod. It fosters community involvement through meetups and open-source collaboration, encouraging innovation and shared development.

Conclusion

Infinity is an adaptable and powerful API for developers and researchers looking to leverage advanced model deployment and data processing capabilities. Its support for diverse model types and ease of use makes it an invaluable tool in the growing field of AI and data analytics.