Nanoflow  by efeslab

LLM serving framework for high throughput

Created 1 year ago
946 stars

Top 38.6% on SourcePulse

GitHubView on GitHub
Project Summary

NanoFlow is a high-performance LLM serving framework designed to maximize throughput by exploiting intra-device parallelism. It targets researchers and engineers needing to serve LLMs efficiently, offering significant throughput gains over existing solutions like vLLM and TensorRT-LLM.

How It Works

NanoFlow introduces "nano-batching" to split requests at the operation granularity, enabling the overlapping of compute-, memory-, and network-bound operations within a single GPU. This intra-device parallelism, managed by a device-level pipeline with execution unit scheduling, aims to keep all hardware resources busy. Asynchronous CPU scheduling further optimizes performance by overlapping KV-cache management and batch formation with GPU execution.

Quick Start & Requirements

  • Installation: Docker is recommended. A setup script setup.sh is provided.
  • Prerequisites: NVIDIA GPUs, CUDA, NVHPC container (v23.11-devel-cuda_multi-ubuntu22.04 recommended), Anaconda.
  • Serving: ./serve.sh
  • Evaluation: ./perf.sh
  • Resources: Requires significant GPU memory for larger models.
  • Links: Paper, Slides

Highlighted Details

  • Achieves up to 1.91x higher throughput than TensorRT-LLM.
  • Supports models like Llama2/3 70B, Llama3.1 70B, and Qwen2 72B.
  • Integrates CUTLASS, FlashInfer, and MSCCL++.
  • Eagerly offloads KV-cache to SSDs for efficient multi-round conversation handling.

Maintenance & Community

The project is actively developed with recent updates supporting new models. It reuses code from and is inspired by projects like CUTLASS, FlashInfer, MSCCL++, and Punica.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is primarily C++ based with a Python frontend, and the README does not detail specific limitations or unsupported platforms. The lack of an explicit license may pose a barrier to commercial adoption.

Health Check
Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 17 hours ago
Feedback? Help us improve.