ScaleLLM  by vectorch-ai

LLM inference system for production environments

created 2 years ago
460 stars

Top 66.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ScaleLLM is a high-performance inference system for large language models (LLMs) designed for production environments. It offers an OpenAI-compatible API and supports a wide range of popular open-source models, aiming to provide efficient and customizable LLM deployment.

How It Works

ScaleLLM leverages state-of-the-art techniques including Flash Attention, Paged Attention, and continuous batching for high-efficiency inference. It also supports tensor parallelism for distributed model execution. Advanced features like CUDA graphs, prefix caching, chunked prefill, and speculative decoding are integrated to further optimize performance and reduce latency.

Quick Start & Requirements

  • Installation: pip install -U scalellm
  • CUDA/PyTorch Versions: Supports custom installations via index URLs (e.g., pip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.4.1/).
  • Prerequisites: Requires NVIDIA GPUs newer than Turing architecture.
  • Demo: An OpenAI-compatible REST API server can be started with python3 -m scalellm.serve.api_server --model=<model_name>. A chatbot UI is available via Docker.
  • Docs: https://docs.vectorch.com/

Highlighted Details

  • Supports a broad range of LLMs including Llama, Gemma, Mistral, Qwen, and more.
  • Offers both chat and completion APIs.
  • Integrates GPTQ and AWQ quantization methods.
  • Features CUDA graph support for reduced kernel launch overhead.

Maintenance & Community

  • Active development with a public roadmap available on GitHub Issues.
  • Community support via Discord: https://discord.gg/PKe5gvBZfn and GitHub Discussions.

Licensing & Compatibility

  • Licensed under the Apache 2.0 license.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Currently only supports GPUs newer than the Turing architecture.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
15
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 13 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 9 hours ago
Feedback? Help us improve.