chitu by thu-pacman

High-performance LLM inference framework

Created 10 months ago

1,381 stars

Top 29.1% on SourcePulse

Project Summary

Chitu is a high-performance inference framework for large language models, designed for production-grade deployment with a focus on efficiency, flexibility, and scalability. It targets enterprises and researchers needing to deploy LLMs from small-scale experiments to large-scale clusters, offering optimized performance across diverse hardware and deployment scenarios.

How It Works

Chitu employs a highly optimized inference engine that supports various quantization techniques, including FP8 and FP4, to reduce memory footprint and increase throughput. It features advanced parallelism strategies (Tensor Parallelism and Pipeline Parallelism) and efficient operator implementations for NVIDIA GPUs and domestic accelerators. The framework prioritizes long-term stability for production environments and offers features like CPU+GPU heterogeneous inference.

Quick Start & Requirements

Install: Clone the repository with --recursive and install dependencies via pip install -r requirements-build.txt, pip install -U torch --index-url https://download.pytorch.org/whl/cu124 (replace cu124 with your CUDA version), and pip install --no-build-isolation ..
Prerequisites: CUDA Toolkit (e.g., 12.4), PyTorch with matching CUDA support.
Resources: Requires significant GPU memory for large models (e.g., 671B models need ~400GB for FP4).
Docs: Development Manual

Highlighted Details

Supports FP4 quantization with FP4->FP8/BF16 conversion, achieving high throughput and reduced memory usage (e.g., 671B model fits in <400GB VRAM).
Offers CPU+GPU heterogeneous inference for memory-constrained scenarios.
Demonstrates strong performance on large models like DeepSeek-R1-671B, with benchmarks showing significant token/s rates across different batch sizes and hardware configurations.
Provides a serving component with a RESTful API for easy integration.

Maintenance & Community

Active development with recent releases (v0.3.0, v0.2.2).
Welcomes contributions via a Contribution Guide.
Community discussion via GitHub Issues and a WeChat group.
Acknowledges support from China Telecom, Huawei, Mosix, and others.

Licensing & Compatibility

Licensed under Apache License v2.0.
Third-party submodules may have different licenses, detailed in their respective directories.
Permits commercial use and linking with closed-source applications.

Limitations & Caveats

FP4->FP8/BF16 operator implementations in v0.3.0 have room for performance optimization.
CPU+GPU heterogeneous inference performance can be bottlenecked by CPU and main memory.
The team cannot guarantee timely resolution for all user-reported issues due to resource constraints.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

2

Star History

33 stars in the last 30 days

Explore Similar Projects

MoE-Infinity by EfficientMoE

Cost-effective, fast MoE model inference library

Created 2 years ago

Updated 2 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

TileRT by tile-ai

Ultra-low-latency LLM inference runtime

Created 2 months ago

Updated 2 weeks ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dash-infer by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago

Updated 5 months ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 4 years ago

Updated 1 year ago

fiddler by efeslab

Fast local inference for large Mixture-of-Experts LLMs

Created 1 year ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

glake by antgroup

GPU optimization library for memory management and IO

Created 2 years ago

Updated 9 months ago

FlagScale by flagos-ai

Large model toolkit for end-to-end management and scaling

Created 2 years ago

Updated 2 days ago

bolt by huawei-noah

Deep learning library for high-performance, heterogeneous deployment

Created 6 years ago

Updated 9 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

2 more.

DeepSeek-MoE by deepseek-ai

MoE language model for research purposes

Created 2 years ago

Updated 2 years ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

High-performance C++ LLM inference library

Created 2 years ago

Updated 1 month ago

Starred by

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM),

Simon Willison

Simon Willison(Coauthor of Django), and

9 more.

CTranslate2 by OpenNMT

Fast inference engine for Transformer models

Created 6 years ago

Updated 14 hours ago

FastDeploy by PaddlePaddle

Toolkit for LLM deployment

Created 3 years ago

Updated 1 day ago

Feedback? Help us improve.