serving  by tensorflow

Serving system for machine learning models in production

created 9 years ago
6,310 stars

Top 8.3% on sourcepulse

GitHubView on GitHub
Project Summary

TensorFlow Serving is a high-performance system for deploying machine learning models in production, primarily targeting ML engineers and researchers. It addresses the challenge of efficiently serving inference requests by managing model lifecycles and providing versioned access, enabling seamless updates and A/B testing without client code changes.

How It Works

TensorFlow Serving is built for flexibility and performance, supporting multiple models and versions simultaneously. It exposes both gRPC and HTTP endpoints for inference. A key feature is its scheduler, which batches inference requests for joint execution, particularly on GPUs, with configurable latency controls. This approach minimizes latency and maximizes throughput by optimizing hardware utilization.

Quick Start & Requirements

  • Install/Run: Docker is the recommended installation method.
    docker pull tensorflow/serving
    docker run -t --rm -p 8501:8501 \
        -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
        -e MODEL_NAME=half_plus_two \
        tensorflow/serving &
    
  • Prerequisites: Docker, SavedModel format for TensorFlow models.
  • Resources: Minimal setup time for Docker. Serving models requires sufficient compute resources (CPU/GPU) depending on model complexity.
  • Docs: Official TensorFlow Serving Documentation

Highlighted Details

  • Serves multiple models or versions concurrently.
  • Supports both gRPC and RESTful HTTP inference APIs.
  • Enables canarying new versions and A/B testing.
  • Features a request batching scheduler for optimized GPU utilization.
  • Extensible to serve non-TensorFlow models and custom servables.

Maintenance & Community

Maintained by the TensorFlow team at Google. Contribution guidelines are available.

Licensing & Compatibility

Apache 2.0 License. Compatible with commercial use and closed-source applications.

Limitations & Caveats

While extensible, serving non-TensorFlow models requires custom implementation. Performance tuning may require understanding of SavedModel warmup and signature definitions.

Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
48 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.6%
3k
GPU cluster manager for AI model deployment
created 1 year ago
updated 2 days ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 9 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 11 hours ago
Feedback? Help us improve.