inferflow by inferflow

High-performance LLM inference engine

Created 2 years ago

250 stars

Top 100.0% on SourcePulse

Project Summary

An efficient and highly configurable inference engine for large language models (LLMs), Inferflow simplifies serving diverse transformer models without requiring source code modifications. It targets engineers and researchers needing to deploy LLMs, offering benefits like reduced setup complexity, support for large models on consumer hardware, and advanced optimization techniques.

How It Works

Inferflow employs a modular framework with atomic building blocks, enabling users to serve new models by editing configuration files rather than writing code. This approach promotes compositional generalization. Key advantages include a novel 3.5-bit quantization scheme alongside other bit-depths, and sophisticated hybrid model partitioning for efficient multi-GPU inference, a feature seldom found in other engines. It also features a custom C++ parser for safely loading models from pickle files, mitigating security risks.

Quick Start & Requirements

Installation involves building from source using CMake. For GPU support, CUDA Toolkit is required, with build commands like cmake ../.. -DUSE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=75 && make install -j 8. CPU-only builds use -DUSE_CUDA=0. Detailed instructions are available for Windows, Linux, macOS, and WSL. Users must download model weights separately.

Highlighted Details

Supports 2-bit to 8-bit quantization, including a novel 3.5-bit scheme.
Offers partition-by-layer, partition-by-tensor, and hybrid parallelism for multi-GPU setups.
Safely loads models from pickle, safetensors, and gguf formats via a custom C++ parser.
Accommodates decoder-only, encoder-only, and encoder-decoder model architectures.
Provides GPU/CPU hybrid inference capabilities.
Includes compatibility with OpenAI's Chat Completions API.

Maintenance & Community

The project released version 0.1.0 in January 2024 and added Mixture-of-Experts (MoE) support in February 2024, indicating active development. No specific community channels (like Discord or Slack) or detailed roadmap are provided in the README.

Licensing & Compatibility

The specific open-source license for Inferflow is not explicitly stated in the README. This omission requires further investigation for commercial use or integration into proprietary systems.

Limitations & Caveats

As of version 0.1.0, Inferflow may still be considered in early development. Building from source is the primary installation method, which can be a barrier for less experienced users. The absence of a clearly defined license is a significant caveat for adoption.

inferflow by inferflow

Explore Similar Projects

r1-ktransformers-guide by ubergarm

ntransformer by xaskasdf

dash-infer by modelscope

calm by zeux

candle-vllm by EricLBuehler

ZhiLight by zhihu

GPTQModel by ModelCloud

ik_llama.cpp by ikawrakow

exllama by turboderp

chatglm.cpp by li-plus

ktransformers by kvcache-ai

llamafile by mozilla-ai