Discover and explore top open-source AI tools and projects—updated daily.
vectorch-aiLLM inference system for production environments
Top 63.7% on SourcePulse
ScaleLLM is a high-performance inference system for large language models (LLMs) designed for production environments. It offers an OpenAI-compatible API and supports a wide range of popular open-source models, aiming to provide efficient and customizable LLM deployment.
How It Works
ScaleLLM leverages state-of-the-art techniques including Flash Attention, Paged Attention, and continuous batching for high-efficiency inference. It also supports tensor parallelism for distributed model execution. Advanced features like CUDA graphs, prefix caching, chunked prefill, and speculative decoding are integrated to further optimize performance and reduce latency.
Quick Start & Requirements
pip install -U scalellmpip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.4.1/).python3 -m scalellm.serve.api_server --model=<model_name>. A chatbot UI is available via Docker.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 day ago
1 day
ScalingIntelligence
cli99
zhihu
lyogavin
vllm-project