DFloat11 by LeanModels

Lossless compression framework for efficient LLM GPU inference

Created 9 months ago

589 stars

Top 55.2% on SourcePulse

1 Expert Loves This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

DFloat11 is a lossless compression framework designed to reduce the size of Large Language Models (LLMs) by approximately 30%, enabling efficient GPU inference on resource-constrained hardware. It targets researchers and engineers working with LLMs who need to optimize memory usage and inference speed without compromising model accuracy.

How It Works

DFloat11 achieves lossless compression by employing a novel dynamic-length floating-point representation. This approach encodes model weights in a way that preserves bit-for-bit identical outputs compared to the original BFloat16 model. The framework integrates seamlessly with the HuggingFace ecosystem, allowing for easy adoption and use with existing LLM pipelines.

Quick Start & Requirements

Installation: pip install dfloat11[cuda12] or pip install dfloat11[cuda11]
Prerequisites: CUDA-compatible GPU, PyTorch.
Usage: Inference script provided (inference.py) or via HuggingFace from_pretrained with DFloat11ModelForCausalLM.
Links: Pre-compressed Models, Code Repository

Highlighted Details

~30% lossless size reduction for LLM weights.
Bit-for-bit identical outputs to original BFloat16 models.
Up to 38.8x faster generation compared to CPU offloading.
Enables up to 13.17x longer context length within the same GPU memory budget.

Maintenance & Community

Developed by Rice University and xMAD.ai. GPU kernel designed by Tianyi Zhang.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Requires a CUDA-compatible GPU. The specific license and its implications for commercial use are not detailed in the provided README.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

20 stars in the last 30 days

Explore Similar Projects

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama).

neural-speed by intel

Library for efficient LLM inference via low-bit quantization

Created 2 years ago

Updated 1 year ago

Starred by

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

2 more.

Nanoflow by efeslab

LLM serving framework for high throughput

Created 1 year ago

Updated 2 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Wing Lian

Wing Lian(Founder of Axolotl AI).

LLaMA_MPS by jankais3r

LLM inference on Apple Silicon GPUs

Created 2 years ago

Updated 2 years ago

Starred by

Salvatore Sanfilippo

Salvatore Sanfilippo(Author of Redis).

prima.cpp by Lizonghang

Distributed llama.cpp implementation for low-resource LLM inference

Created 1 year ago

Updated 5 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

marlin by IST-DASLab

FP16xINT4 kernel for fast LLM inference

Created 2 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Cody Yu

Cody Yu(Coauthor of vLLM; MTS at OpenAI).

kvpress by NVIDIA

LLM KV cache compression made easy

Created 1 year ago

Updated 3 weeks ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

LiteRT-LM by google-ai-edge

C++ library for efficient on-device LLM execution

Created 9 months ago

Updated 1 day ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

LLM inference engine for diverse applications

Created 2 years ago

Updated 13 hours ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 2 days ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

8 more.

DeepSpeed-MII by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago

Updated 6 months ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django) and

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

ollm by Mega4alik

Large-context LLM inference on consumer hardware

Created 4 months ago

Updated 1 month ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

23 more.

FlexLLMGen by FMInference

High-throughput generation engine for LLMs with limited GPU memory

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.