flex-nano-vllm by changjonathanc

Fast Gemma 2 inference engine

Created 5 months ago

328 stars

Top 83.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Project Summary

This project provides a minimal, vLLM-style inference engine optimized for fast Gemma 2 inference, leveraging FlexAttention without custom Triton kernels or FlashAttention dependencies. It's designed for users needing efficient and straightforward LLM inference, particularly for Gemma 2 models.

How It Works

The engine implements paged attention using an adapted version of the implementation from pytorch-labs/attention-gym. The Gemma 2 model implementation is directly copied and modified from Hugging Face's transformers library to integrate with FlexAttention and paged attention mechanisms. This approach aims for a flat code structure and commented code for clarity and ease of understanding.

Quick Start & Requirements

Install/Run: Use uv for synchronization and running benchmarks. Example commands: uv sync run test and benchmark, uv run benchmark.py, uv run benchmark_vllm.py.
Prerequisites: PyTorch 2.7.1+cu128, RTX 3090 (or equivalent GPU with sufficient memory), and the uv package manager.
Resources: Benchmarking was performed on a single RTX 3090 (24GB) with a workload of 512 requests, 512 input tokens, and variable output tokens (128-512).

Highlighted Details

Benchmarks show competitive performance against vLLM, with specific throughput figures varying based on GPU memory utilization and batch size configurations.
The project explicitly avoids dependencies on flash-attn and custom triton kernels, relying solely on FlexAttention.
Gemma 2 model implementation is directly integrated from Hugging Face transformers.

Maintenance & Community

Inspired by GeeeekExplorer/nano-vllm.
Paged attention implementation is based on pytorch-labs/attention-gym.
Gemma 2 model code is adapted from huggingface/transformers.
Mentions vllm-project/vllm for insights into FlexAttention backend flags.

Licensing & Compatibility

Licensed under the MIT License.
Third-party code retains its original licenses, detailed in THIRD_PARTY_LICENSES.md.
MIT license generally permits commercial use and linking with closed-source projects.

Limitations & Caveats

The provided benchmarks indicate that flex-nano-vllm is generally slower than vLLM across tested configurations, particularly at higher GPU memory utilization. The project is described as "minimal" and a "blog post is coming soon," suggesting it may be in early development or lack comprehensive features compared to more mature libraries like vLLM.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days