transformers-bloom-inference by huggingface

Inference solutions for BLOOM models

Created 3 years ago

565 stars

Top 56.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Edward Sun

Research Scientist at Meta Superintelligence Lab

Luca Soldaini

Research Scientist at Ai2

Chuan Li

Chief Scientific Officer at Lambda

Philipp Schmid

DevRel at Google DeepMind

and 1 more!

Project Summary

This repository provides fast inference solutions for the BLOOM language model, targeting researchers and developers needing to deploy large models efficiently. It offers integration with Hugging Face Accelerate and DeepSpeed Inference for optimized generation, aiming to simplify and speed up the deployment process.

How It Works

The project leverages DeepSpeed Inference and Hugging Face Accelerate for efficient BLOOM model inference. DeepSpeed Inference utilizes ZeroQuant for post-training quantization (INT8), while Hugging Face Accelerate uses LLM.int8() for quantization. These methods enable faster inference and reduced memory footprint, particularly for the large BLOOM 176B model, supporting FP16, BF16, and INT8 data types.

Quick Start & Requirements

Install: pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2
Prerequisites: Tested on 8x A100 80GB GPUs (FP16/BF16) or 4x A100 80GB GPUs (INT8). May require specific CUDA versions and DeepSpeed installation from source for advanced features.
Resources: Significant GPU memory and compute are required for BLOOM 176B.
Docs: Links to backend-specific issues for Accelerate, DeepSpeed-Inference, and DeepSpeed-ZeRO are provided.

Highlighted Details

Supports both Hugging Face Accelerate and DeepSpeed Inference backends.
Offers command-line interfaces for direct inference and a server deployment option.
Includes a benchmark system to evaluate inference performance.
Provides options for INT8 quantization using ZeroQuant (DeepSpeed) or LLM.int8() (HF Accelerate).

Maintenance & Community

This repository has been archived and is no longer maintained, with newer solutions like vLLM and TGI recommended. Issues should be opened in the respective backend repositories (Accelerate, DeepSpeed-Inference, DeepSpeed-ZeRO).

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is archived and no longer maintained. The provided scripts are specifically tested for BLOOM 176B on particular GPU configurations and may not work with other models or hardware setups. GPU memory may not be freed upon crashes.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days