flex-nano-vllm  by changjonathanc

Fast Gemma 2 inference engine

Created 1 month ago
275 stars

Top 94.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a minimal, vLLM-style inference engine optimized for fast Gemma 2 inference, leveraging FlexAttention without custom Triton kernels or FlashAttention dependencies. It's designed for users needing efficient and straightforward LLM inference, particularly for Gemma 2 models.

How It Works

The engine implements paged attention using an adapted version of the implementation from pytorch-labs/attention-gym. The Gemma 2 model implementation is directly copied and modified from Hugging Face's transformers library to integrate with FlexAttention and paged attention mechanisms. This approach aims for a flat code structure and commented code for clarity and ease of understanding.

Quick Start & Requirements

  • Install/Run: Use uv for synchronization and running benchmarks. Example commands: uv sync run test and benchmark, uv run benchmark.py, uv run benchmark_vllm.py.
  • Prerequisites: PyTorch 2.7.1+cu128, RTX 3090 (or equivalent GPU with sufficient memory), and the uv package manager.
  • Resources: Benchmarking was performed on a single RTX 3090 (24GB) with a workload of 512 requests, 512 input tokens, and variable output tokens (128-512).

Highlighted Details

  • Benchmarks show competitive performance against vLLM, with specific throughput figures varying based on GPU memory utilization and batch size configurations.
  • The project explicitly avoids dependencies on flash-attn and custom triton kernels, relying solely on FlexAttention.
  • Gemma 2 model implementation is directly integrated from Hugging Face transformers.

Maintenance & Community

  • Inspired by GeeeekExplorer/nano-vllm.
  • Paged attention implementation is based on pytorch-labs/attention-gym.
  • Gemma 2 model code is adapted from huggingface/transformers.
  • Mentions vllm-project/vllm for insights into FlexAttention backend flags.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Third-party code retains its original licenses, detailed in THIRD_PARTY_LICENSES.md.
  • MIT license generally permits commercial use and linking with closed-source projects.

Limitations & Caveats

The provided benchmarks indicate that flex-nano-vllm is generally slower than vLLM across tested configurations, particularly at higher GPU memory utilization. The project is described as "minimal" and a "blog post is coming soon," suggesting it may be in early development or lack comprehensive features compared to more mature libraries like vLLM.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
34 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 14 hours ago
Feedback? Help us improve.