ScaleLLM by vectorch-ai

LLM inference system for production environments

Created 2 years ago

489 stars

Top 63.1% on SourcePulse

1 Expert Loves This Project

Ying1123

Coauthor of SGLang

Project Summary

ScaleLLM is a high-performance inference system for large language models (LLMs) designed for production environments. It offers an OpenAI-compatible API and supports a wide range of popular open-source models, aiming to provide efficient and customizable LLM deployment.

How It Works

ScaleLLM leverages state-of-the-art techniques including Flash Attention, Paged Attention, and continuous batching for high-efficiency inference. It also supports tensor parallelism for distributed model execution. Advanced features like CUDA graphs, prefix caching, chunked prefill, and speculative decoding are integrated to further optimize performance and reduce latency.

Quick Start & Requirements

Installation: pip install -U scalellm
CUDA/PyTorch Versions: Supports custom installations via index URLs (e.g., pip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.4.1/).
Prerequisites: Requires NVIDIA GPUs newer than Turing architecture.
Demo: An OpenAI-compatible REST API server can be started with python3 -m scalellm.serve.api_server --model=<model_name>. A chatbot UI is available via Docker.
Docs: https://docs.vectorch.com/

Highlighted Details

Supports a broad range of LLMs including Llama, Gemma, Mistral, Qwen, and more.
Offers both chat and completion APIs.
Integrates GPTQ and AWQ quantization methods.
Features CUDA graph support for reduced kernel launch overhead.

Maintenance & Community

Active development with a public roadmap available on GitHub Issues.
Community support via Discord: https://discord.gg/PKe5gvBZfn and GitHub Discussions.

Licensing & Compatibility

Licensed under the Apache 2.0 license.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Currently only supports GPUs newer than the Turing architecture.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dash-infer by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago

Updated 5 months ago

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect),

Shishir Patil

Shishir Patil(Author of BFCL, Gorilla), and

2 more.

tokasaurus by ScalingIntelligence

LLM inference engine for high-throughput workloads

Created 7 months ago

Updated 1 month ago

KsanaLLM by Tencent

LLM inference and serving engine

Created 1 year ago

Updated 5 days ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang) and

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

CLI tool for LLM latency/memory analysis during training/inference

Created 2 years ago

Updated 8 months ago

InferLLM by MegEngine

Lightweight LLM inference framework

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 1 year ago

Updated 6 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Michael Han

Michael Han(Cofounder of Unsloth), and

4 more.

aphrodite-engine by aphrodite-engine

LLM inference engine for serving HuggingFace models at scale

Created 2 years ago

Updated 4 days ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

8 more.

DeepSpeed-MII by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago

Updated 6 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Michael Han

Michael Han(Cofounder of Unsloth), and

10 more.

exllamav2 by turboderp-org

Inference library for running LLMs locally on consumer GPUs

Created 2 years ago

Updated 1 month ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

Updated 4 months ago

Starred by

Michael Han

Michael Han(Cofounder of Unsloth),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

11 more.

lmdeploy by InternLM

Toolkit for LLM compression, deployment, and serving

Created 2 years ago

Updated 2 days ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Clement Delangue

Clement Delangue(Cofounder of Hugging Face), and

60 more.

vllm by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

Created 2 years ago

Updated 14 hours ago

Feedback? Help us improve.