jina - Comprehensive AI Service Framework Supporting gRPC, HTTP, and WebSockets Deployment

Introduction to Jina-Serve

Jina-Serve is a comprehensive framework designed for building and deploying AI services. It facilitates communication through gRPC, HTTP, and WebSockets, allowing developers to focus on their core logic while easily scaling services from local environments to full production deployments.

Key Features

Wide Compatibility: Natively supports major machine learning frameworks and diverse data types.
Optimized Performance: High-performance architecture featuring scaling, streaming, and dynamic batching for efficient service operations.
LLM Serving: Supports streaming output for large language models.
Integration with Containers: Comes with built-in Docker support and access to the Executor Hub.
Easy Deployment: Offers one-click deployment to Jina AI Cloud.
Enterprise-Ready: Compatible with Kubernetes and Docker Compose for seamless enterprise integration.

Advantages Over FastAPI

Jina-Serve offers several benefits over traditional frameworks like FastAPI, such as:

DocArray-Based Data Handling: Ensures efficient data manipulation with native gRPC support.
Integrated Containerization: Simplifies service orchestration, enabling easier scaling.
Effortless Cloud Deployment: Deploy services to the cloud with a single command.

Installation

To get started with Jina-Serve, install it via pip:

pip install jina

For specific operating systems, check out the installation guides for Apple Silicon and Windows.

Core Concepts

Jina-Serve is built on three foundational layers:

Data Layer: Utilizes BaseDoc and DocList for input and output management.
Serving Layer: Executors process documents while the Gateway handles service connections.
Orchestration Layer: Deployments manage Executors, while Flows create and manage data pipelines.

Creating AI Services

Here's how you can set up a gRPC-based AI service with StableLM:

from jina import Executor, requests
from docarray import DocList, BaseDoc
from transformers import pipeline

class Prompt(BaseDoc):
    text: str

class Generation(BaseDoc):
    prompt: str
    text: str

class StableLM(Executor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.generator = pipeline('text-generation', model='stabilityai/stablelm-base-alpha-3b')

    @requests
    def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]:
        generations = DocList[Generation]()
        prompts = docs.text
        llm_outputs = self.generator(prompts)
        for prompt, output in zip(prompts, llm_outputs):
            generations.append(Generation(prompt=prompt, text=output))
        return generations

Deployment

Deploy your service using Python or YAML:

from jina import Deployment
from executor import StableLM

dep = Deployment(uses=StableLM, timeout_ready=-1, port=12345)

with dep:
    dep.block()

jtype: Deployment
with:
 uses: StableLM
 py_modules:
   - executor.py
 timeout_ready: -1
 port: 12345

Client Usage

Connect and interact with your service like this:

from jina import Client
from docarray import DocList
from executor import Prompt, Generation

prompt = Prompt(text='suggest an interesting image generation prompt')
client = Client(port=12345)
response = client.post('/', inputs=[prompt], return_type=DocList[Generation])

Building Pipelines

Chain multiple services into a complete data processing Flow:

from jina import Flow

flow = Flow(port=12345).add(uses=StableLM).add(uses=TextToImage)

with flow:
    flow.block()

Scaling and Deployment

Local Scaling

Enhance service throughput using features such as:

Replicas: For parallel processing.
Shards: To partition data efficiently.
Dynamic Batching: For optimal model inference.

Example for scaling a deployment:

jtype: Deployment
with:
 uses: TextToImage
 timeout_ready: -1
 py_modules:
   - text_to_image.py
 env:
  CUDA_VISIBLE_DEVICES: RR
 replicas: 2
 uses_dynamic_batching:
   /default:
     preferred_batch_size: 10
     timeout: 200

Cloud Deployment

Containerize Your Services: Organize code and configuration files, then push to the Jina Hub.

Kubernetes Deployment:

jina export kubernetes flow.yml ./my-k8s
kubectl apply -R -f my-k8s

Docker Compose:

jina export docker-compose flow.yml docker-compose.yml
docker-compose up

JCloud Deployment: Deploy with just one command:

jina cloud deploy jcloud-flow.yml

LLM Streaming

Stream data for responsive applications:

Define Data Schemas:

from docarray import BaseDoc

class PromptDocument(BaseDoc):
    prompt: str
    max_tokens: int

class ModelOutputDocument(BaseDoc):
    token_id: int
    generated_text: str

Implement Streaming:

@requests(on='/stream')
async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument:
    # Streaming logic

Deploy the Service:

# Server-side deployment
with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc') as dep:
    dep.block()

# Client-side interaction
async def main():
    client = Client(port=12345, protocol='grpc', asyncio=True)
    async for doc in client.stream_doc(
        on='/stream',
        inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10),
        return_type=ModelOutputDocument,
    ):
        print(doc.generated_text)

Conclusion

Jina-Serve, supported by Jina AI, is a powerful tool for developers looking to build, scale, and deploy AI services with ease. Licensed under Apache-2.0, it's ready for a wide range of applications.