ScaleLLM  by vectorch-ai

LLM inference system for production environments

Created 2 years ago
466 stars

Top 65.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ScaleLLM is a high-performance inference system for large language models (LLMs) designed for production environments. It offers an OpenAI-compatible API and supports a wide range of popular open-source models, aiming to provide efficient and customizable LLM deployment.

How It Works

ScaleLLM leverages state-of-the-art techniques including Flash Attention, Paged Attention, and continuous batching for high-efficiency inference. It also supports tensor parallelism for distributed model execution. Advanced features like CUDA graphs, prefix caching, chunked prefill, and speculative decoding are integrated to further optimize performance and reduce latency.

Quick Start & Requirements

  • Installation: pip install -U scalellm
  • CUDA/PyTorch Versions: Supports custom installations via index URLs (e.g., pip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.4.1/).
  • Prerequisites: Requires NVIDIA GPUs newer than Turing architecture.
  • Demo: An OpenAI-compatible REST API server can be started with python3 -m scalellm.serve.api_server --model=<model_name>. A chatbot UI is available via Docker.
  • Docs: https://docs.vectorch.com/

Highlighted Details

  • Supports a broad range of LLMs including Llama, Gemma, Mistral, Qwen, and more.
  • Offers both chat and completion APIs.
  • Integrates GPTQ and AWQ quantization methods.
  • Features CUDA graph support for reduced kernel launch overhead.

Maintenance & Community

  • Active development with a public roadmap available on GitHub Issues.
  • Community support via Discord: https://discord.gg/PKe5gvBZfn and GitHub Discussions.

Licensing & Compatibility

  • Licensed under the Apache 2.0 license.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Currently only supports GPUs newer than the Turing architecture.
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0.4%
455
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 5 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 11 hours ago
Feedback? Help us improve.