djl-serving by deepjavalibrary

Scalable ML model serving via HTTP endpoints

Created 4 years ago

253 stars

Top 99.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

David Phillips

Author of Trino, Presto

Project Summary

DJL Serving is a high-performance, universal model deployment solution designed to make machine learning models accessible via a scalable HTTP endpoint. It targets engineers and researchers seeking an efficient way to serve diverse model types, offering benefits such as simplified deployment, automatic scaling, and high throughput within a single Java Virtual Machine (JVM).

How It Works

DJL Serving operates by running inference across multiple threads within a single JVM, aiming for superior throughput compared to many C++-based model servers. It natively supports popular formats like PyTorch TorchScript, TensorFlow SavedModel, ONNX, and Python scripts, with extensibility for others like XGBoost and LightGBM via plugins. The system automatically scales worker threads based on load, supports dynamic batching to enhance throughput, and allows serving multiple model versions or models from different engines concurrently on a single endpoint.

Quick Start & Requirements

macOS: Install via Homebrew: brew install djl-serving. Start/stop services with brew services start djl-serving / brew services stop djl-serving.
Ubuntu: Download .deb package: curl -O https://publish.djl.ai/djl-serving/djl-serving_0.30.0-1_all.deb, then install: sudo dpkg -i djl-serving_0.30.0-1_all.deb.
Windows: Download zip from https://publish.djl.ai/djl-serving/serving-0.30.0.zip, unzip, and run serving-0.30.0\bin\serving.bat. A Chocolatey package is under consideration.
Docker: Run using docker run -itd -p 8080:8080 deepjavalibrary/djl-serving.
Prerequisites: OS-specific package managers or download links.
Documentation: Command-line help available via djl-serving --help. Links for configuration, architecture, and plugin management are referenced but not provided in the README.

Highlighted Details

Performance: Claims higher throughput than most C++ model servers based on internal benchmarks.
Ease of Use: Capable of serving most model types out-of-the-box.
Extensibility: Supports custom extensions through a plugin system.
Auto-scaling: Dynamically adjusts worker threads based on inference load.
Dynamic Batching: Aggregates inference requests to improve throughput.
Model Versioning: Allows loading and managing different model versions on the same endpoint.
Multi-Engine Support: Facilitates serving models from various deep learning frameworks simultaneously.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), sponsorships, or roadmaps are present in the provided README.

Licensing & Compatibility

The license type and any compatibility notes for commercial or closed-source use are not specified in the provided README.

Limitations & Caveats

The Windows installation currently relies on manual zip file extraction, with official package support pending. By default, DJL Serving listens on port 8080 but is only accessible from localhost, requiring configuration changes for remote access.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days