Discover and explore top open-source AI tools and projects—updated daily.
1CatAILLM inference acceleration for legacy GPUs
Top 87.1% on SourcePulse
This project provides a specialized fork of vLLM, engineered to significantly enhance the performance and usability of large language models on older Tesla V100 (SM70) GPUs. It targets developers and teams seeking to leverage existing V100 hardware for modern AI workloads, offering substantial improvements in inference speed, context handling, and deployment stability for AWQ-quantized models. The primary benefit is revitalizing V100 hardware, making previously challenging LLM deployments feasible and efficient.
How It Works
1Cat-vLLM achieves its performance gains through a systematic re-engineering of vLLM for SM70 architectures. It integrates lmdeploy TurboMind SM70 WMMA kernels and the FLASH_ATTN_V100 attention backend, specifically optimizing for Volta GPUs. This approach overcomes the SM75+ requirement of upstream vLLM AWQ kernels, enabling V100s to efficiently serve modern AWQ-quantized models, including dense and Mixture-of-Experts (MoE) variants like Qwen3.5/3.6, with improved long-context stability and faster inference.
Quick Start & Requirements
python -m pip install --prefer-binary --no-cache-dir \
--extra-index-url https://download.pytorch.org/whl/cu128 \
"https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/flash_attn_v100-1.0.0-cp312-cp312-linux_x86_64.whl" \
"https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/vllm-1.0.0-cp312-cp312-linux_x86_64.whl"
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.debHighlighted Details
FLASH_ATTN_V100 backend for optimized attention mechanisms on Volta.fp8_e5m2).Maintenance & Community
The project is primarily driven by "一猫之下" (1CatAI), contributing engineering experience and optimizations. Community interaction is facilitated via a WeChat group (1Cat-vLLM 开源交流群2).
Licensing & Compatibility
This repository adheres to the upstream vLLM license model. Compatibility is focused on Tesla V100 (SM70) hardware; performance on other architectures is not guaranteed. Commercial use is generally permitted under the terms of the vLLM license.
Limitations & Caveats
This fork is highly specialized for SM70 / Tesla V100 hardware and is not intended as a general-purpose vLLM replacement. Multimodal and vision workloads are not the default public profile and require separate tuning. Initial requests on V100 may experience significant compilation delays.
4 days ago
Inactive
Mega4alik
lyogavin