LLM serving framework for high throughput
Top 42.7% on sourcepulse
NanoFlow is a high-performance LLM serving framework designed to maximize throughput by exploiting intra-device parallelism. It targets researchers and engineers needing to serve LLMs efficiently, offering significant throughput gains over existing solutions like vLLM and TensorRT-LLM.
How It Works
NanoFlow introduces "nano-batching" to split requests at the operation granularity, enabling the overlapping of compute-, memory-, and network-bound operations within a single GPU. This intra-device parallelism, managed by a device-level pipeline with execution unit scheduling, aims to keep all hardware resources busy. Asynchronous CPU scheduling further optimizes performance by overlapping KV-cache management and batch formation with GPU execution.
Quick Start & Requirements
setup.sh
is provided../serve.sh
./perf.sh
Highlighted Details
Maintenance & Community
The project is actively developed with recent updates supporting new models. It reuses code from and is inspired by projects like CUTLASS, FlashInfer, MSCCL++, and Punica.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial or closed-source use is not specified.
Limitations & Caveats
The framework is primarily C++ based with a Python frontend, and the README does not detail specific limitations or unsupported platforms. The lack of an explicit license may pose a barrier to commercial adoption.
3 weeks ago
1 week