Discover and explore top open-source AI tools and projects—updated daily.
changjonathancFast Gemma 2 inference engine
Top 83.5% on SourcePulse
This project provides a minimal, vLLM-style inference engine optimized for fast Gemma 2 inference, leveraging FlexAttention without custom Triton kernels or FlashAttention dependencies. It's designed for users needing efficient and straightforward LLM inference, particularly for Gemma 2 models.
How It Works
The engine implements paged attention using an adapted version of the implementation from pytorch-labs/attention-gym. The Gemma 2 model implementation is directly copied and modified from Hugging Face's transformers library to integrate with FlexAttention and paged attention mechanisms. This approach aims for a flat code structure and commented code for clarity and ease of understanding.
Quick Start & Requirements
uv for synchronization and running benchmarks. Example commands: uv sync run test and benchmark, uv run benchmark.py, uv run benchmark_vllm.py.uv package manager.Highlighted Details
flash-attn and custom triton kernels, relying solely on FlexAttention.transformers.Maintenance & Community
GeeeekExplorer/nano-vllm.pytorch-labs/attention-gym.huggingface/transformers.vllm-project/vllm for insights into FlexAttention backend flags.Licensing & Compatibility
THIRD_PARTY_LICENSES.md.Limitations & Caveats
The provided benchmarks indicate that flex-nano-vllm is generally slower than vLLM across tested configurations, particularly at higher GPU memory utilization. The project is described as "minimal" and a "blog post is coming soon," suggesting it may be in early development or lack comprehensive features compared to more mature libraries like vLLM.
2 months ago
Inactive
hao-ai-lab
SafeAILab
ModelTC