swiftLLM  by interestingLSY

LLM inference system for research

Created 1 year ago
264 stars

Top 96.8% on SourcePulse

GitHubView on GitHub
Project Summary

SwiftLLM is a lightweight LLM inference system designed for research purposes, offering performance comparable to vLLM with a significantly smaller codebase. It targets researchers who need to understand, modify, and extend LLM inference systems without the complexity of production-focused frameworks.

How It Works

SwiftLLM employs a master-worker architecture, separating concerns into a control plane for scheduling and a data plane for computation. It leverages Python and OpenAI Triton for efficient CUDA kernel implementation. Key features include iterative scheduling, selective batching, PagedAttention, and FlashAttention, enabling high performance with a minimal code footprint.

Quick Start & Requirements

  • Installation: Clone the repository, install dependencies via pip install -r requirements.txt, and install SwiftLLM with pip install -e . and pip install -e csrc.
  • Prerequisites: Python >= 3.9, PyTorch (correct version for hardware), packaging. NVIDIA GPU is required for Triton kernels.
  • Model Weights: Users must download model weights separately (supports .bin and .safetensors formats).
  • Examples: Offline serving (examples/offline.py), online serving (examples/online.py), and a vLLM-like API server (swiftllm/server/api_server.py).

Highlighted Details

  • Achieves vLLM-equivalent or better performance in single forward operations and online serving benchmarks.
  • Significantly outperforms vLLM on RTX 4090 due to lower control plane overhead.
  • Codebase is ~2% the size of vLLM (~2k lines of code), facilitating research and modification.
  • Supports LLaMA/LLaMA2/LLaMA3 models and variants.

Maintenance & Community

The project is under active development, with some features potentially incomplete and documentation limited. Future plans include support for tensor and pipeline parallelism.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

SwiftLLM is explicitly not a production-ready solution. Features like quantization, LoRA, multimodal models, and non-greedy sampling are not supported and would require custom implementation. Support is limited to NVIDIA GPUs, though migration to other hardware is possible if supported by OpenAI Triton.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Alexander Borzunov Alexander Borzunov(Research Scientist at OpenAI), and
17 more.

gpt-oss by openai

0.7%
18k
Open-weight LLMs for reasoning and agents
Created 2 months ago
Updated 3 days ago
Feedback? Help us improve.