Chitu is a high-performance inference framework for large language models, designed for production-grade deployment with a focus on efficiency, flexibility, and scalability. It targets enterprises and researchers needing to deploy LLMs from small-scale experiments to large-scale clusters, offering optimized performance across diverse hardware and deployment scenarios.
How It Works
Chitu employs a highly optimized inference engine that supports various quantization techniques, including FP8 and FP4, to reduce memory footprint and increase throughput. It features advanced parallelism strategies (Tensor Parallelism and Pipeline Parallelism) and efficient operator implementations for NVIDIA GPUs and domestic accelerators. The framework prioritizes long-term stability for production environments and offers features like CPU+GPU heterogeneous inference.
Quick Start & Requirements
- Install: Clone the repository with
--recursive
and install dependencies via pip install -r requirements-build.txt
, pip install -U torch --index-url https://download.pytorch.org/whl/cu124
(replace cu124
with your CUDA version), and pip install --no-build-isolation .
.
- Prerequisites: CUDA Toolkit (e.g., 12.4), PyTorch with matching CUDA support.
- Resources: Requires significant GPU memory for large models (e.g., 671B models need ~400GB for FP4).
- Docs: Development Manual
Highlighted Details
- Supports FP4 quantization with FP4->FP8/BF16 conversion, achieving high throughput and reduced memory usage (e.g., 671B model fits in <400GB VRAM).
- Offers CPU+GPU heterogeneous inference for memory-constrained scenarios.
- Demonstrates strong performance on large models like DeepSeek-R1-671B, with benchmarks showing significant token/s rates across different batch sizes and hardware configurations.
- Provides a serving component with a RESTful API for easy integration.
Maintenance & Community
- Active development with recent releases (v0.3.0, v0.2.2).
- Welcomes contributions via a Contribution Guide.
- Community discussion via GitHub Issues and a WeChat group.
- Acknowledges support from China Telecom, Huawei, Mosix, and others.
Licensing & Compatibility
- Licensed under Apache License v2.0.
- Third-party submodules may have different licenses, detailed in their respective directories.
- Permits commercial use and linking with closed-source applications.
Limitations & Caveats
- FP4->FP8/BF16 operator implementations in v0.3.0 have room for performance optimization.
- CPU+GPU heterogeneous inference performance can be bottlenecked by CPU and main memory.
- The team cannot guarantee timely resolution for all user-reported issues due to resource constraints.