Inference solutions for BLOOM models
Top 58.0% on sourcepulse
This repository provides fast inference solutions for the BLOOM language model, targeting researchers and developers needing to deploy large models efficiently. It offers integration with Hugging Face Accelerate and DeepSpeed Inference for optimized generation, aiming to simplify and speed up the deployment process.
How It Works
The project leverages DeepSpeed Inference and Hugging Face Accelerate for efficient BLOOM model inference. DeepSpeed Inference utilizes ZeroQuant for post-training quantization (INT8), while Hugging Face Accelerate uses LLM.int8() for quantization. These methods enable faster inference and reduced memory footprint, particularly for the large BLOOM 176B model, supporting FP16, BF16, and INT8 data types.
Quick Start & Requirements
pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2
Highlighted Details
Maintenance & Community
This repository has been archived and is no longer maintained, with newer solutions like vLLM and TGI recommended. Issues should be opened in the respective backend repositories (Accelerate, DeepSpeed-Inference, DeepSpeed-ZeRO).
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is archived and no longer maintained. The provided scripts are specifically tested for BLOOM 176B on particular GPU configurations and may not work with other models or hardware setups. GPU memory may not be freed upon crashes.
9 months ago
Inactive