Serving system for machine learning models in production
Top 8.3% on sourcepulse
TensorFlow Serving is a high-performance system for deploying machine learning models in production, primarily targeting ML engineers and researchers. It addresses the challenge of efficiently serving inference requests by managing model lifecycles and providing versioned access, enabling seamless updates and A/B testing without client code changes.
How It Works
TensorFlow Serving is built for flexibility and performance, supporting multiple models and versions simultaneously. It exposes both gRPC and HTTP endpoints for inference. A key feature is its scheduler, which batches inference requests for joint execution, particularly on GPUs, with configurable latency controls. This approach minimizes latency and maximizes throughput by optimizing hardware utilization.
Quick Start & Requirements
docker pull tensorflow/serving
docker run -t --rm -p 8501:8501 \
-v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" \
-e MODEL_NAME=half_plus_two \
tensorflow/serving &
Highlighted Details
Maintenance & Community
Maintained by the TensorFlow team at Google. Contribution guidelines are available.
Licensing & Compatibility
Apache 2.0 License. Compatible with commercial use and closed-source applications.
Limitations & Caveats
While extensible, serving non-TensorFlow models requires custom implementation. Performance tuning may require understanding of SavedModel warmup and signature definitions.
1 week ago
1 week