flashtensors  by leoheuler

Fast inference engine for large language models

Created 7 months ago
447 stars

Top 66.5% on SourcePulse

GitHubView on GitHub
Project Summary

flashtensors is a high-performance inference engine and model loader designed to drastically reduce LLM loading times, enabling users to run numerous large models on a single GPU with minimal latency impact. It targets engineers and power users seeking efficient, scalable AI deployments for applications ranging from personalized AI to robotics and serverless inference. The primary benefit is enabling rapid model hot-swapping and significantly faster coldstarts, making large models more accessible and cost-effective.

How It Works

The project employs a ground-up redesign of model loading mechanisms, optimizing the transfer of model weights from SSD to GPU VRAM. It leverages techniques like memory pooling and chunking to achieve speeds up to 10x faster than traditional loaders. This approach facilitates near-instantaneous model hot-swapping (< 2 seconds) and sub-2-second coldstarts, even for multi-billion parameter models, by eliminating I/O and CPU bottlenecks. It integrates with popular inference backends like vLLM.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/leoheuler/flashtensors.git
  • Prerequisites: Requires a GPU with sufficient VRAM, Python, and likely CUDA. Model storage path configuration is necessary.
  • Resource Footprint: Configurable GPU memory utilization and memory pool size. Benchmarks indicate coldstarts under 5 seconds for 32B models on H100 GPUs.
  • Links:
    • CLI Quick Start: flash start, flash pull, `flash run
Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Balaji Srinivasan Balaji Srinivasan(Founder of The Network School; Author of "The Network State"; Former CTO of Coinbase; Cofounder of Counsyl), Abubakar Abid Abubakar Abid(Cofounder of Gradio), and
14 more.

ds4 by antirez

3.3%
13k
Fast local inference for DeepSeek V4 Flash models
Created 1 month ago
Updated 1 day ago
Feedback? Help us improve.