flashtensors by leoheuler

Fast inference engine for large language models

Created 4 months ago

442 stars

Top 67.8% on SourcePulse

Project Summary

flashtensors is a high-performance inference engine and model loader designed to drastically reduce LLM loading times, enabling users to run numerous large models on a single GPU with minimal latency impact. It targets engineers and power users seeking efficient, scalable AI deployments for applications ranging from personalized AI to robotics and serverless inference. The primary benefit is enabling rapid model hot-swapping and significantly faster coldstarts, making large models more accessible and cost-effective.

How It Works

The project employs a ground-up redesign of model loading mechanisms, optimizing the transfer of model weights from SSD to GPU VRAM. It leverages techniques like memory pooling and chunking to achieve speeds up to 10x faster than traditional loaders. This approach facilitates near-instantaneous model hot-swapping (< 2 seconds) and sub-2-second coldstarts, even for multi-billion parameter models, by eliminating I/O and CPU bottlenecks. It integrates with popular inference backends like vLLM.

Quick Start & Requirements

Installation: pip install git+https://github.com/leoheuler/flashtensors.git
Prerequisites: Requires a GPU with sufficient VRAM, Python, and likely CUDA. Model storage path configuration is necessary.
Resource Footprint: Configurable GPU memory utilization and memory pool size. Benchmarks indicate coldstarts under 5 seconds for 32B models on H100 GPUs.
Links:
- CLI Quick Start: flash start, flash pull, `flash run

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

David Cournapeau

David Cournapeau(Author of scikit-learn).

ntransformer by xaskasdf

LLM inference engine enabling large models on consumer GPUs

Created 3 weeks ago

Updated 2 weeks ago

Lvllm by guqiong96

Efficient hybrid CPU-GPU inference for large language models

Created 5 months ago

Updated 2 days ago

MoE-Infinity by EfficientMoE

Cost-effective, fast MoE model inference library

Created 2 years ago

Updated 1 week ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

glake by antgroup

GPU optimization library for memory management and IO

Created 2 years ago

Updated 11 months ago

FastFlowLM by FastFlowLM

LLM inference optimized for AMD Ryzen™ AI NPUs

Created 9 months ago

Updated 1 day ago

omniserve by mit-han-lab

Unified inference engine for large-scale LLM serving

Created 1 year ago

Updated 1 year ago

optiml by NU-QRG

Accelerate LLM agents on consumer hardware

Created 7 months ago

Updated 7 months ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django) and

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

ollm by Mega4alik

Large-context LLM inference on consumer hardware

Created 6 months ago

Updated 3 months ago

Starred by

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM),

Simon Willison

Simon Willison(Coauthor of Django), and

9 more.

CTranslate2 by OpenNMT

Fast inference engine for Transformer models

Created 6 years ago

Updated 1 month ago

chitu by thu-pacman

High-performance LLM inference framework

Created 1 year ago

Updated 1 day ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

2 more.

mini-sglang by sgl-project

Lightweight LLM inference framework with advanced optimizations

Created 6 months ago

Updated 23 hours ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Clement Delangue

Clement Delangue(Cofounder of Hugging Face), and

60 more.

vllm by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

Created 3 years ago

Updated 22 hours ago

Feedback? Help us improve.