chitu  by thu-pacman

High-performance LLM inference framework

Created 7 months ago
1,278 stars

Top 31.1% on SourcePulse

GitHubView on GitHub
Project Summary

Chitu is a high-performance inference framework for large language models, designed for production-grade deployment with a focus on efficiency, flexibility, and scalability. It targets enterprises and researchers needing to deploy LLMs from small-scale experiments to large-scale clusters, offering optimized performance across diverse hardware and deployment scenarios.

How It Works

Chitu employs a highly optimized inference engine that supports various quantization techniques, including FP8 and FP4, to reduce memory footprint and increase throughput. It features advanced parallelism strategies (Tensor Parallelism and Pipeline Parallelism) and efficient operator implementations for NVIDIA GPUs and domestic accelerators. The framework prioritizes long-term stability for production environments and offers features like CPU+GPU heterogeneous inference.

Quick Start & Requirements

  • Install: Clone the repository with --recursive and install dependencies via pip install -r requirements-build.txt, pip install -U torch --index-url https://download.pytorch.org/whl/cu124 (replace cu124 with your CUDA version), and pip install --no-build-isolation ..
  • Prerequisites: CUDA Toolkit (e.g., 12.4), PyTorch with matching CUDA support.
  • Resources: Requires significant GPU memory for large models (e.g., 671B models need ~400GB for FP4).
  • Docs: Development Manual

Highlighted Details

  • Supports FP4 quantization with FP4->FP8/BF16 conversion, achieving high throughput and reduced memory usage (e.g., 671B model fits in <400GB VRAM).
  • Offers CPU+GPU heterogeneous inference for memory-constrained scenarios.
  • Demonstrates strong performance on large models like DeepSeek-R1-671B, with benchmarks showing significant token/s rates across different batch sizes and hardware configurations.
  • Provides a serving component with a RESTful API for easy integration.

Maintenance & Community

  • Active development with recent releases (v0.3.0, v0.2.2).
  • Welcomes contributions via a Contribution Guide.
  • Community discussion via GitHub Issues and a WeChat group.
  • Acknowledges support from China Telecom, Huawei, Mosix, and others.

Licensing & Compatibility

  • Licensed under Apache License v2.0.
  • Third-party submodules may have different licenses, detailed in their respective directories.
  • Permits commercial use and linking with closed-source applications.

Limitations & Caveats

  • FP4->FP8/BF16 operator implementations in v0.3.0 have room for performance optimization.
  • CPU+GPU heterogeneous inference performance can be bottlenecked by CPU and main memory.
  • The team cannot guarantee timely resolution for all user-reported issues due to resource constraints.
Health Check
Last Commit

23 hours ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
6
Star History
39 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.