flashtensors  by leoheuler

Fast inference engine for large language models

Created 1 month ago
405 stars

Top 71.7% on SourcePulse

GitHubView on GitHub
Project Summary

flashtensors is a high-performance inference engine and model loader designed to drastically reduce LLM loading times, enabling users to run numerous large models on a single GPU with minimal latency impact. It targets engineers and power users seeking efficient, scalable AI deployments for applications ranging from personalized AI to robotics and serverless inference. The primary benefit is enabling rapid model hot-swapping and significantly faster coldstarts, making large models more accessible and cost-effective.

How It Works

The project employs a ground-up redesign of model loading mechanisms, optimizing the transfer of model weights from SSD to GPU VRAM. It leverages techniques like memory pooling and chunking to achieve speeds up to 10x faster than traditional loaders. This approach facilitates near-instantaneous model hot-swapping (< 2 seconds) and sub-2-second coldstarts, even for multi-billion parameter models, by eliminating I/O and CPU bottlenecks. It integrates with popular inference backends like vLLM.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/leoheuler/flashtensors.git
  • Prerequisites: Requires a GPU with sufficient VRAM, Python, and likely CUDA. Model storage path configuration is necessary.
  • Resource Footprint: Configurable GPU memory utilization and memory pool size. Benchmarks indicate coldstarts under 5 seconds for 32B models on H100 GPUs.
  • Links:
    • CLI Quick Start: flash start, flash pull, `flash run
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
5
Star History
334 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

0.8%
64k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.