Lossless compression framework for efficient LLM GPU inference
Top 67.1% on sourcepulse
DFloat11 is a lossless compression framework designed to reduce the size of Large Language Models (LLMs) by approximately 30%, enabling efficient GPU inference on resource-constrained hardware. It targets researchers and engineers working with LLMs who need to optimize memory usage and inference speed without compromising model accuracy.
How It Works
DFloat11 achieves lossless compression by employing a novel dynamic-length floating-point representation. This approach encodes model weights in a way that preserves bit-for-bit identical outputs compared to the original BFloat16 model. The framework integrates seamlessly with the HuggingFace ecosystem, allowing for easy adoption and use with existing LLM pipelines.
Quick Start & Requirements
pip install dfloat11[cuda12]
or pip install dfloat11[cuda11]
inference.py
) or via HuggingFace from_pretrained
with DFloat11ModelForCausalLM
.Highlighted Details
Maintenance & Community
Developed by Rice University and xMAD.ai. GPU kernel designed by Tianyi Zhang.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Requires a CUDA-compatible GPU. The specific license and its implications for commercial use are not detailed in the provided README.
2 days ago
Inactive