Nanoflow by efeslab

LLM serving framework for high throughput

Created 1 year ago

937 stars

Top 39.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Inference Lead at SGLang; Research Scientist at Together AI

Zhuohan Li

Coauthor of vLLM

Project Summary

NanoFlow is a high-performance LLM serving framework designed to maximize throughput by exploiting intra-device parallelism. It targets researchers and engineers needing to serve LLMs efficiently, offering significant throughput gains over existing solutions like vLLM and TensorRT-LLM.

How It Works

NanoFlow introduces "nano-batching" to split requests at the operation granularity, enabling the overlapping of compute-, memory-, and network-bound operations within a single GPU. This intra-device parallelism, managed by a device-level pipeline with execution unit scheduling, aims to keep all hardware resources busy. Asynchronous CPU scheduling further optimizes performance by overlapping KV-cache management and batch formation with GPU execution.

Quick Start & Requirements

Installation: Docker is recommended. A setup script setup.sh is provided.
Prerequisites: NVIDIA GPUs, CUDA, NVHPC container (v23.11-devel-cuda_multi-ubuntu22.04 recommended), Anaconda.
Serving: ./serve.sh
Evaluation: ./perf.sh
Resources: Requires significant GPU memory for larger models.
Links: Paper, Slides

Highlighted Details

Achieves up to 1.91x higher throughput than TensorRT-LLM.
Supports models like Llama2/3 70B, Llama3.1 70B, and Qwen2 72B.
Integrates CUTLASS, FlashInfer, and MSCCL++.
Eagerly offloads KV-cache to SSDs for efficient multi-round conversation handling.

Maintenance & Community

The project is actively developed with recent updates supporting new models. It reuses code from and is inspired by projects like CUTLASS, FlashInfer, MSCCL++, and Punica.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is primarily C++ based with a Python frontend, and the README does not detail specific limitations or unsupported platforms. The lack of an explicit license may pose a barrier to commercial adoption.

Health Check

Last Commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days