transformers-bloom-inference  by huggingface

Inference solutions for BLOOM models

created 2 years ago
563 stars

Top 58.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides fast inference solutions for the BLOOM language model, targeting researchers and developers needing to deploy large models efficiently. It offers integration with Hugging Face Accelerate and DeepSpeed Inference for optimized generation, aiming to simplify and speed up the deployment process.

How It Works

The project leverages DeepSpeed Inference and Hugging Face Accelerate for efficient BLOOM model inference. DeepSpeed Inference utilizes ZeroQuant for post-training quantization (INT8), while Hugging Face Accelerate uses LLM.int8() for quantization. These methods enable faster inference and reduced memory footprint, particularly for the large BLOOM 176B model, supporting FP16, BF16, and INT8 data types.

Quick Start & Requirements

  • Install: pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2
  • Prerequisites: Tested on 8x A100 80GB GPUs (FP16/BF16) or 4x A100 80GB GPUs (INT8). May require specific CUDA versions and DeepSpeed installation from source for advanced features.
  • Resources: Significant GPU memory and compute are required for BLOOM 176B.
  • Docs: Links to backend-specific issues for Accelerate, DeepSpeed-Inference, and DeepSpeed-ZeRO are provided.

Highlighted Details

  • Supports both Hugging Face Accelerate and DeepSpeed Inference backends.
  • Offers command-line interfaces for direct inference and a server deployment option.
  • Includes a benchmark system to evaluate inference performance.
  • Provides options for INT8 quantization using ZeroQuant (DeepSpeed) or LLM.int8() (HF Accelerate).

Maintenance & Community

This repository has been archived and is no longer maintained, with newer solutions like vLLM and TGI recommended. Issues should be opened in the respective backend repositories (Accelerate, DeepSpeed-Inference, DeepSpeed-ZeRO).

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is archived and no longer maintained. The provided scripts are specifically tested for BLOOM 176B on particular GPU configurations and may not work with other models or hardware setups. GPU memory may not be freed upon crashes.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.