serving - TensorFlow Serving as a Versatile Solution for High-Performance Machine Learning Inferences

Introduction to TensorFlow Serving

TensorFlow Serving is a versatile and efficient system designed for deploying machine learning models in production environments. Focused on handling the inference phase of machine learning, TensorFlow Serving efficiently manages models post-training. It allows for the seamless access and use of these models through a version-controlled lookup system that operates with high-performance standards. Although it is optimally integrated to serve TensorFlow models, it can be extended to accommodate other model types and data requirements.

Key Features

Multi-Model Support: Capable of managing multiple models or multiple versions of a single model at the same time.
Endpoint Compatibility: Offers both gRPC and HTTP endpoints for model inference.
Seamless Updates: New model versions can be deployed without necessitating changes in the client code.
Testing Capabilities: Supports canary deployments and A/B testing to trial new and experimental models.
Low Latency: Adds minimal delay due to its efficient low-overhead design.
Batch Scheduling: Groups inference requests into batches for execution on GPUs, with configurable latency settings.
Diverse Model Support: Compatible with TensorFlow models, embeddings, vocabularies, feature transformations, and even non-TensorFlow models.

Quick Start: Serving a TensorFlow Model in 60 Seconds

To quickly deploy a TensorFlow model with TensorFlow Serving, follow a few simple steps:

Download the TensorFlow Serving Docker image and repository:

docker pull tensorflow/serving
git clone https://github.com/tensorflow/serving

Start a TensorFlow Serving container and open the REST API port:

docker run -t --rm -p 8501:8501 \
    -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
    -e MODEL_NAME=half_plus_two \
    tensorflow/serving &

Use the predict API to query the model:

curl -d '{"instances": [1.0, 2.0, 5.0]}' \
    -X POST http://localhost:8501/v1/models/half_plus_two:predict

This command returns predictions corresponding to the model inputs.

Comprehensive Tutorials and Documentation

For a more detailed guide on training and serving a TensorFlow model, visit the official TensorFlow documentation.

Setting Up

The simplest method for using TensorFlow Serving is through Docker:

Install TensorFlow Serving using Docker (Recommended)
Install TensorFlow Serving without Docker (Not Recommended)
Build TensorFlow Serving from Source
Deploy on Kubernetes

Usage

To serve a TensorFlow model, first export it as a SavedModel. This format is designed for easy consumption and transformation by various systems. For more details on exporting models, refer to the TensorFlow guide.

Extending Functionality

TensorFlow Serving is highly modular, allowing users to extend its capabilities for new use cases:

Acquaint yourself with building TensorFlow Serving.
Understand its architecture.
Explore the C++ API reference.
Develop new types of servable models or custom sources for model versions.

Contribution and Further Information

Those interested in contributing to TensorFlow Serving should review the contribution guidelines.

For additional information, please visit the official TensorFlow website.