chitu  by thu-pacman

High-performance LLM inference framework

created 5 months ago
1,184 stars

Top 33.6% on sourcepulse

GitHubView on GitHub
Project Summary

Chitu is a high-performance inference framework for large language models, designed for production-grade deployment with a focus on efficiency, flexibility, and scalability. It targets enterprises and researchers needing to deploy LLMs from small-scale experiments to large-scale clusters, offering optimized performance across diverse hardware and deployment scenarios.

How It Works

Chitu employs a highly optimized inference engine that supports various quantization techniques, including FP8 and FP4, to reduce memory footprint and increase throughput. It features advanced parallelism strategies (Tensor Parallelism and Pipeline Parallelism) and efficient operator implementations for NVIDIA GPUs and domestic accelerators. The framework prioritizes long-term stability for production environments and offers features like CPU+GPU heterogeneous inference.

Quick Start & Requirements

  • Install: Clone the repository with --recursive and install dependencies via pip install -r requirements-build.txt, pip install -U torch --index-url https://download.pytorch.org/whl/cu124 (replace cu124 with your CUDA version), and pip install --no-build-isolation ..
  • Prerequisites: CUDA Toolkit (e.g., 12.4), PyTorch with matching CUDA support.
  • Resources: Requires significant GPU memory for large models (e.g., 671B models need ~400GB for FP4).
  • Docs: Development Manual

Highlighted Details

  • Supports FP4 quantization with FP4->FP8/BF16 conversion, achieving high throughput and reduced memory usage (e.g., 671B model fits in <400GB VRAM).
  • Offers CPU+GPU heterogeneous inference for memory-constrained scenarios.
  • Demonstrates strong performance on large models like DeepSeek-R1-671B, with benchmarks showing significant token/s rates across different batch sizes and hardware configurations.
  • Provides a serving component with a RESTful API for easy integration.

Maintenance & Community

  • Active development with recent releases (v0.3.0, v0.2.2).
  • Welcomes contributions via a Contribution Guide.
  • Community discussion via GitHub Issues and a WeChat group.
  • Acknowledges support from China Telecom, Huawei, Mosix, and others.

Licensing & Compatibility

  • Licensed under Apache License v2.0.
  • Third-party submodules may have different licenses, detailed in their respective directories.
  • Permits commercial use and linking with closed-source applications.

Limitations & Caveats

  • FP4->FP8/BF16 operator implementations in v0.3.0 have room for performance optimization.
  • CPU+GPU heterogeneous inference performance can be bottlenecked by CPU and main memory.
  • The team cannot guarantee timely resolution for all user-reported issues due to resource constraints.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
91 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 13 hours ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 17 hours ago
Feedback? Help us improve.