swiftLLM by interestingLSY

LLM inference system for research

Created 1 year ago

304 stars

Top 88.2% on SourcePulse

Project Summary

SwiftLLM is a lightweight LLM inference system designed for research purposes, offering performance comparable to vLLM with a significantly smaller codebase. It targets researchers who need to understand, modify, and extend LLM inference systems without the complexity of production-focused frameworks.

How It Works

SwiftLLM employs a master-worker architecture, separating concerns into a control plane for scheduling and a data plane for computation. It leverages Python and OpenAI Triton for efficient CUDA kernel implementation. Key features include iterative scheduling, selective batching, PagedAttention, and FlashAttention, enabling high performance with a minimal code footprint.

Quick Start & Requirements

Installation: Clone the repository, install dependencies via pip install -r requirements.txt, and install SwiftLLM with pip install -e . and pip install -e csrc.
Prerequisites: Python >= 3.9, PyTorch (correct version for hardware), packaging. NVIDIA GPU is required for Triton kernels.
Model Weights: Users must download model weights separately (supports .bin and .safetensors formats).
Examples: Offline serving (examples/offline.py), online serving (examples/online.py), and a vLLM-like API server (swiftllm/server/api_server.py).

Highlighted Details

Achieves vLLM-equivalent or better performance in single forward operations and online serving benchmarks.
Significantly outperforms vLLM on RTX 4090 due to lower control plane overhead.
Codebase is ~2% the size of vLLM (~2k lines of code), facilitating research and modification.
Supports LLaMA/LLaMA2/LLaMA3 models and variants.

Maintenance & Community

The project is under active development, with some features potentially incomplete and documentation limited. Future plans include support for tensor and pipeline parallelism.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

SwiftLLM is explicitly not a production-ready solution. Features like quantization, LoRA, multimodal models, and non-greedy sampling are not supported and would require custom implementation. Support is limited to NVIDIA GPUs, though migration to other hardware is possible if supported by OpenAI Triton.

swiftLLM by interestingLSY

Explore Similar Projects

Kolo by MaxHastings

awesome-AI-system by lambda7xx

Kolosal by KolosalAI

lumos by allenai

ScaleLLM by vectorch-ai

Seed-Coder by ByteDance-Seed

LLM-Viewer by hahnyuan

Yuan-2.0 by IEIT-Yuan

llm_note by harleyszhang

efficient-dl-systems by mryab

cv_note by harleyszhang

gpt-oss by openai