Introduction to Jina-Serve
Jina-Serve is a comprehensive framework designed for building and deploying AI services. It facilitates communication through gRPC, HTTP, and WebSockets, allowing developers to focus on their core logic while easily scaling services from local environments to full production deployments.
Key Features
- Wide Compatibility: Natively supports major machine learning frameworks and diverse data types.
- Optimized Performance: High-performance architecture featuring scaling, streaming, and dynamic batching for efficient service operations.
- LLM Serving: Supports streaming output for large language models.
- Integration with Containers: Comes with built-in Docker support and access to the Executor Hub.
- Easy Deployment: Offers one-click deployment to Jina AI Cloud.
- Enterprise-Ready: Compatible with Kubernetes and Docker Compose for seamless enterprise integration.
Advantages Over FastAPI
Jina-Serve offers several benefits over traditional frameworks like FastAPI, such as:
- DocArray-Based Data Handling: Ensures efficient data manipulation with native gRPC support.
- Integrated Containerization: Simplifies service orchestration, enabling easier scaling.
- Effortless Cloud Deployment: Deploy services to the cloud with a single command.
Installation
To get started with Jina-Serve, install it via pip:
pip install jina
For specific operating systems, check out the installation guides for Apple Silicon and Windows.
Core Concepts
Jina-Serve is built on three foundational layers:
- Data Layer: Utilizes BaseDoc and DocList for input and output management.
- Serving Layer: Executors process documents while the Gateway handles service connections.
- Orchestration Layer: Deployments manage Executors, while Flows create and manage data pipelines.
Creating AI Services
Here's how you can set up a gRPC-based AI service with StableLM:
from jina import Executor, requests
from docarray import DocList, BaseDoc
from transformers import pipeline
class Prompt(BaseDoc):
text: str
class Generation(BaseDoc):
prompt: str
text: str
class StableLM(Executor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.generator = pipeline('text-generation', model='stabilityai/stablelm-base-alpha-3b')
@requests
def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]:
generations = DocList[Generation]()
prompts = docs.text
llm_outputs = self.generator(prompts)
for prompt, output in zip(prompts, llm_outputs):
generations.append(Generation(prompt=prompt, text=output))
return generations
Deployment
Deploy your service using Python or YAML:
from jina import Deployment
from executor import StableLM
dep = Deployment(uses=StableLM, timeout_ready=-1, port=12345)
with dep:
dep.block()
jtype: Deployment
with:
uses: StableLM
py_modules:
- executor.py
timeout_ready: -1
port: 12345
Client Usage
Connect and interact with your service like this:
from jina import Client
from docarray import DocList
from executor import Prompt, Generation
prompt = Prompt(text='suggest an interesting image generation prompt')
client = Client(port=12345)
response = client.post('/', inputs=[prompt], return_type=DocList[Generation])
Building Pipelines
Chain multiple services into a complete data processing Flow:
from jina import Flow
flow = Flow(port=12345).add(uses=StableLM).add(uses=TextToImage)
with flow:
flow.block()
Scaling and Deployment
Local Scaling
Enhance service throughput using features such as:
- Replicas: For parallel processing.
- Shards: To partition data efficiently.
- Dynamic Batching: For optimal model inference.
Example for scaling a deployment:
jtype: Deployment
with:
uses: TextToImage
timeout_ready: -1
py_modules:
- text_to_image.py
env:
CUDA_VISIBLE_DEVICES: RR
replicas: 2
uses_dynamic_batching:
/default:
preferred_batch_size: 10
timeout: 200
Cloud Deployment
Containerize Your Services: Organize code and configuration files, then push to the Jina Hub.
Kubernetes Deployment:
jina export kubernetes flow.yml ./my-k8s
kubectl apply -R -f my-k8s
Docker Compose:
jina export docker-compose flow.yml docker-compose.yml
docker-compose up
JCloud Deployment: Deploy with just one command:
jina cloud deploy jcloud-flow.yml
LLM Streaming
Stream data for responsive applications:
- Define Data Schemas:
from docarray import BaseDoc
class PromptDocument(BaseDoc):
prompt: str
max_tokens: int
class ModelOutputDocument(BaseDoc):
token_id: int
generated_text: str
- Implement Streaming:
@requests(on='/stream')
async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument:
# Streaming logic
- Deploy the Service:
# Server-side deployment
with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc') as dep:
dep.block()
# Client-side interaction
async def main():
client = Client(port=12345, protocol='grpc', asyncio=True)
async for doc in client.stream_doc(
on='/stream',
inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10),
return_type=ModelOutputDocument,
):
print(doc.generated_text)
Conclusion
Jina-Serve, supported by Jina AI, is a powerful tool for developers looking to build, scale, and deploy AI services with ease. Licensed under Apache-2.0, it's ready for a wide range of applications.