AngelSlim  by Tencent

Model compression toolkit for efficient AI

Created 7 months ago
462 stars

Top 65.5% on SourcePulse

GitHubView on GitHub
Project Summary

Model compression is addressed by AngelSlim, a toolkit engineered for enhanced usability, comprehensiveness, and efficiency, targeting engineers and researchers working with large AI models. It provides a unified framework for applying various compression techniques, enabling more accessible and performant model deployment. The toolkit aims to streamline the model compression workflow, making advanced techniques readily available.

How It Works

The toolkit integrates mainstream compression algorithms, including quantization (e.g., FP8, INT4, NVFP4, Tequila) and speculative decoding (Eagle3), into a unified, user-friendly framework. It focuses on performance optimization across the end-to-end compression workflow, from training to deployment. AngelSlim continuously researches and incorporates novel compression algorithms, offering a path to significantly reduce model size and inference costs while maintaining accuracy.

Quick Start & Requirements

  • Primary install: pip install angelslim or clone and python setup.py install.
  • Prerequisites: GPU acceleration is essential for performance; specific CUDA versions are not detailed but implied. Python 3 environment required.
  • Links: 📖 Documentation, 🤗 Hugging Face, 🤖 ModelScope, 💬 WeChat, 🫨 Discord.

Highlighted Details

  • Supports a broad spectrum of models including Large Language Models (LLMs), Vision Language Models (VLMs), Diffusion Models, and Speech Models from various providers like Tencent, Qwen, and DeepSeek.
  • Offers a comprehensive suite of compression techniques, featuring advanced quantization algorithms (FP8-Static/Dynamic, INT4-GPTQ/AWQ, NVFP4, Tequila) and speculative decoding (Eagle3) with early-exit mechanisms (SpecExit).
  • Demonstrates significant performance gains, with Eagle3 speculative decoding achieving up to 1.9x speedup and improved accept lengths, while quantization methods like FP8 and INT4 show minimal accuracy degradation.
  • Enables efficient deployment of large models, such as Qwen3-235B, on single GPUs through optimized quantization and speculative decoding frameworks.

Maintenance & Community

The project shows active development with frequent releases (e.g., v0.3, v0.2) and ongoing additions of new models and algorithms. Community engagement is facilitated through WeChat, Discord, and GitHub Issues for discussions and support.

Licensing & Compatibility

The code is stated to be open-sourced under "License for AngelSlim." The specific terms and compatibility for commercial use or closed-source linking are not detailed, requiring further clarification.

Limitations & Caveats

Some advanced features, such as token pruning for VLMs and audio models, are listed as "Under Development." The absence of a clearly defined, standard open-source license may pose adoption challenges for certain use cases.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
4
Star History
164 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.4%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
2 more.

Model-Optimizer by NVIDIA

2.2%
2k
Library for optimizing deep learning models for GPU inference
Created 1 year ago
Updated 17 hours ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.1%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 7 months ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.1%
3k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 17 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Dan Guido Dan Guido(Cofounder of Trail of Bits), and
6 more.

llm-compressor by vllm-project

0.7%
3k
Transformers-compatible library for LLM compression, optimized for vLLM deployment
Created 1 year ago
Updated 19 hours ago
Feedback? Help us improve.