ScaleLLM  by vectorch-ai

LLM inference system for production environments

Created 2 years ago
496 stars

Top 62.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ScaleLLM is a high-performance inference system for large language models (LLMs) designed for production environments. It offers an OpenAI-compatible API and supports a wide range of popular open-source models, aiming to provide efficient and customizable LLM deployment.

How It Works

ScaleLLM leverages state-of-the-art techniques including Flash Attention, Paged Attention, and continuous batching for high-efficiency inference. It also supports tensor parallelism for distributed model execution. Advanced features like CUDA graphs, prefix caching, chunked prefill, and speculative decoding are integrated to further optimize performance and reduce latency.

Quick Start & Requirements

  • Installation: pip install -U scalellm
  • CUDA/PyTorch Versions: Supports custom installations via index URLs (e.g., pip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.4.1/).
  • Prerequisites: Requires NVIDIA GPUs newer than Turing architecture.
  • Demo: An OpenAI-compatible REST API server can be started with python3 -m scalellm.serve.api_server --model=<model_name>. A chatbot UI is available via Docker.
  • Docs: https://docs.vectorch.com/

Highlighted Details

  • Supports a broad range of LLMs including Llama, Gemma, Mistral, Qwen, and more.
  • Offers both chat and completion APIs.
  • Integrates GPTQ and AWQ quantization methods.
  • Features CUDA graph support for reduced kernel launch overhead.

Maintenance & Community

  • Active development with a public roadmap available on GitHub Issues.
  • Community support via Discord: https://discord.gg/PKe5gvBZfn and GitHub Discussions.

Licensing & Compatibility

  • Licensed under the Apache 2.0 license.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Currently only supports GPUs newer than the Turing architecture.
Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0%
485
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 11 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
61 more.

vllm by vllm-project

1.2%
76k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 20 hours ago
Feedback? Help us improve.