sagemaker-huggingface-inference-toolkit by aws

Serving Hugging Face models on Amazon SageMaker

Created 4 years ago

271 stars

Top 95.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Lewis Tunstall

Research Engineer at Hugging Face

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This library provides an open-source toolkit for deploying Hugging Face Transformers and Diffusers models on Amazon SageMaker, simplifying the inference process for developers and researchers. It offers default pre-processing, prediction, and post-processing for common Hugging Face models and tasks, leveraging the SageMaker Inference Toolkit for efficient model serving.

How It Works

The toolkit integrates with the SageMaker Inference Toolkit to manage model server startup and inference requests. It utilizes environment variables like HF_TASK and HF_MODEL_ID to automatically configure and load models from the Hugging Face Hub. Users can also provide custom inference logic by overriding default handler methods or including a code/inference.py script within their model artifacts.

Quick Start & Requirements

Install: pip install sagemaker --upgrade

Deploy from S3:

from sagemaker.huggingface import HuggingFaceModel
huggingface_model = HuggingFaceModel(
    transformers_version='4.6', pytorch_version='1.7', py_version='py36',
    model_data='s3://my-trained-model/artifacts/model.tar.gz', role=role,
)
huggingface_model.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

Deploy from Hugging Face Hub (experimental):

hub = {'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', 'HF_TASK':'question-answering'}
huggingface_model = HuggingFaceModel(
    transformers_version='4.6', pytorch_version='1.7', py_version='py36',
    env=hub, role=role,
)
huggingface_model.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

Documentation: SageMaker Notebook Examples

Highlighted Details

Supports deployment of Hugging Face models on AWS Inferentia2, with options for pre-compiled models or on-the-fly compilation using HF_OPTIMUM_BATCH_SIZE and HF_OPTIMUM_SEQUENCE_LENGTH.
Allows customization of inference logic through user-defined scripts (code/inference.py) that can override model_fn, transform_fn, input_fn, predict_fn, and output_fn.
Environment variables simplify configuration, including HF_MODEL_REVISION for pinning model versions and HF_API_TOKEN for private models.
Provides local testing capabilities by running the inference server directly via Python.