Nanoflow  by efeslab

LLM serving framework for high throughput

created 11 months ago
856 stars

Top 42.7% on sourcepulse

GitHubView on GitHub
Project Summary

NanoFlow is a high-performance LLM serving framework designed to maximize throughput by exploiting intra-device parallelism. It targets researchers and engineers needing to serve LLMs efficiently, offering significant throughput gains over existing solutions like vLLM and TensorRT-LLM.

How It Works

NanoFlow introduces "nano-batching" to split requests at the operation granularity, enabling the overlapping of compute-, memory-, and network-bound operations within a single GPU. This intra-device parallelism, managed by a device-level pipeline with execution unit scheduling, aims to keep all hardware resources busy. Asynchronous CPU scheduling further optimizes performance by overlapping KV-cache management and batch formation with GPU execution.

Quick Start & Requirements

  • Installation: Docker is recommended. A setup script setup.sh is provided.
  • Prerequisites: NVIDIA GPUs, CUDA, NVHPC container (v23.11-devel-cuda_multi-ubuntu22.04 recommended), Anaconda.
  • Serving: ./serve.sh
  • Evaluation: ./perf.sh
  • Resources: Requires significant GPU memory for larger models.
  • Links: Paper, Slides

Highlighted Details

  • Achieves up to 1.91x higher throughput than TensorRT-LLM.
  • Supports models like Llama2/3 70B, Llama3.1 70B, and Qwen2 72B.
  • Integrates CUTLASS, FlashInfer, and MSCCL++.
  • Eagerly offloads KV-cache to SSDs for efficient multi-round conversation handling.

Maintenance & Community

The project is actively developed with recent updates supporting new models. It reuses code from and is inspired by projects like CUTLASS, FlashInfer, MSCCL++, and Punica.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is primarily C++ based with a Python frontend, and the README does not detail specific limitations or unsupported platforms. The lack of an explicit license may pose a barrier to commercial adoption.

Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
57 stars in the last 90 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 15 hours ago
Feedback? Help us improve.