Discover and explore top open-source AI tools and projects—updated daily.
LLM inference system for research
Top 96.8% on SourcePulse
SwiftLLM is a lightweight LLM inference system designed for research purposes, offering performance comparable to vLLM with a significantly smaller codebase. It targets researchers who need to understand, modify, and extend LLM inference systems without the complexity of production-focused frameworks.
How It Works
SwiftLLM employs a master-worker architecture, separating concerns into a control plane for scheduling and a data plane for computation. It leverages Python and OpenAI Triton for efficient CUDA kernel implementation. Key features include iterative scheduling, selective batching, PagedAttention, and FlashAttention, enabling high performance with a minimal code footprint.
Quick Start & Requirements
pip install -r requirements.txt
, and install SwiftLLM with pip install -e .
and pip install -e csrc
.packaging
. NVIDIA GPU is required for Triton kernels..bin
and .safetensors
formats).examples/offline.py
), online serving (examples/online.py
), and a vLLM-like API server (swiftllm/server/api_server.py
).Highlighted Details
Maintenance & Community
The project is under active development, with some features potentially incomplete and documentation limited. Future plans include support for tensor and pipeline parallelism.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
SwiftLLM is explicitly not a production-ready solution. Features like quantization, LoRA, multimodal models, and non-greedy sampling are not supported and would require custom implementation. Support is limited to NVIDIA GPUs, though migration to other hardware is possible if supported by OpenAI Triton.
3 months ago
1 day