Nanoflow  by efeslab

LLM serving framework for high throughput

Created 1 year ago
891 stars

Top 40.7% on SourcePulse

GitHubView on GitHub
Project Summary

NanoFlow is a high-performance LLM serving framework designed to maximize throughput by exploiting intra-device parallelism. It targets researchers and engineers needing to serve LLMs efficiently, offering significant throughput gains over existing solutions like vLLM and TensorRT-LLM.

How It Works

NanoFlow introduces "nano-batching" to split requests at the operation granularity, enabling the overlapping of compute-, memory-, and network-bound operations within a single GPU. This intra-device parallelism, managed by a device-level pipeline with execution unit scheduling, aims to keep all hardware resources busy. Asynchronous CPU scheduling further optimizes performance by overlapping KV-cache management and batch formation with GPU execution.

Quick Start & Requirements

  • Installation: Docker is recommended. A setup script setup.sh is provided.
  • Prerequisites: NVIDIA GPUs, CUDA, NVHPC container (v23.11-devel-cuda_multi-ubuntu22.04 recommended), Anaconda.
  • Serving: ./serve.sh
  • Evaluation: ./perf.sh
  • Resources: Requires significant GPU memory for larger models.
  • Links: Paper, Slides

Highlighted Details

  • Achieves up to 1.91x higher throughput than TensorRT-LLM.
  • Supports models like Llama2/3 70B, Llama3.1 70B, and Qwen2 72B.
  • Integrates CUTLASS, FlashInfer, and MSCCL++.
  • Eagerly offloads KV-cache to SSDs for efficient multi-round conversation handling.

Maintenance & Community

The project is actively developed with recent updates supporting new models. It reuses code from and is inspired by projects like CUTLASS, FlashInfer, MSCCL++, and Punica.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is primarily C++ based with a Python frontend, and the README does not detail specific limitations or unsupported platforms. The lack of an explicit license may pose a barrier to commercial adoption.

Health Check
Last Commit

1 day ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
5
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.