inferflow  by inferflow

High-performance LLM inference engine

Created 2 years ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

An efficient and highly configurable inference engine for large language models (LLMs), Inferflow simplifies serving diverse transformer models without requiring source code modifications. It targets engineers and researchers needing to deploy LLMs, offering benefits like reduced setup complexity, support for large models on consumer hardware, and advanced optimization techniques.

How It Works

Inferflow employs a modular framework with atomic building blocks, enabling users to serve new models by editing configuration files rather than writing code. This approach promotes compositional generalization. Key advantages include a novel 3.5-bit quantization scheme alongside other bit-depths, and sophisticated hybrid model partitioning for efficient multi-GPU inference, a feature seldom found in other engines. It also features a custom C++ parser for safely loading models from pickle files, mitigating security risks.

Quick Start & Requirements

Installation involves building from source using CMake. For GPU support, CUDA Toolkit is required, with build commands like cmake ../.. -DUSE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=75 && make install -j 8. CPU-only builds use -DUSE_CUDA=0. Detailed instructions are available for Windows, Linux, macOS, and WSL. Users must download model weights separately.

Highlighted Details

  • Supports 2-bit to 8-bit quantization, including a novel 3.5-bit scheme.
  • Offers partition-by-layer, partition-by-tensor, and hybrid parallelism for multi-GPU setups.
  • Safely loads models from pickle, safetensors, and gguf formats via a custom C++ parser.
  • Accommodates decoder-only, encoder-only, and encoder-decoder model architectures.
  • Provides GPU/CPU hybrid inference capabilities.
  • Includes compatibility with OpenAI's Chat Completions API.

Maintenance & Community

The project released version 0.1.0 in January 2024 and added Mixture-of-Experts (MoE) support in February 2024, indicating active development. No specific community channels (like Discord or Slack) or detailed roadmap are provided in the README.

Licensing & Compatibility

The specific open-source license for Inferflow is not explicitly stated in the README. This omission requires further investigation for commercial use or integration into proprietary systems.

Limitations & Caveats

As of version 0.1.0, Inferflow may still be considered in early development. Building from source is the primary installation method, which can be a barrier for less experienced users. The absence of a clearly defined license is a significant caveat for adoption.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.2%
16k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anil Dash Anil Dash(Former CEO of Glitch), and
23 more.

llamafile by mozilla-ai

0.1%
24k
Single-file LLM distribution and runtime via `llama.cpp` and Cosmopolitan Libc
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.