1Cat-vLLM  by 1CatAI

LLM inference acceleration for legacy GPUs

Created 2 months ago
307 stars

Top 87.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a specialized fork of vLLM, engineered to significantly enhance the performance and usability of large language models on older Tesla V100 (SM70) GPUs. It targets developers and teams seeking to leverage existing V100 hardware for modern AI workloads, offering substantial improvements in inference speed, context handling, and deployment stability for AWQ-quantized models. The primary benefit is revitalizing V100 hardware, making previously challenging LLM deployments feasible and efficient.

How It Works

1Cat-vLLM achieves its performance gains through a systematic re-engineering of vLLM for SM70 architectures. It integrates lmdeploy TurboMind SM70 WMMA kernels and the FLASH_ATTN_V100 attention backend, specifically optimizing for Volta GPUs. This approach overcomes the SM75+ requirement of upstream vLLM AWQ kernels, enabling V100s to efficiently serve modern AWQ-quantized models, including dense and Mixture-of-Experts (MoE) variants like Qwen3.5/3.6, with improved long-context stability and faster inference.

Quick Start & Requirements

  • Primary install / run command: Recommended installation via prebuilt wheels:
    python -m pip install --prefer-binary --no-cache-dir \
      --extra-index-url https://download.pytorch.org/whl/cu128 \
      "https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/flash_attn_v100-1.0.0-cp312-cp312-linux_x86_64.whl" \
      "https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/vllm-1.0.0-cp312-cp312-linux_x86_64.whl"
    
  • Non-default prerequisites: Python 3.12, CUDA 12.8, PyTorch 2.9.1+cu128, NVIDIA Driver 570.211.01. Primarily validated on 4x Tesla V100 32 GB systems.
  • Estimated setup time: Docker builds involve multi-gigabyte downloads. First-time kernel compilation on V100 can take 1-3 minutes per request.
  • Links:
    • Release Wheels: GitHub Releases
    • CUDA 12.8 Install (Ubuntu): wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb

Highlighted Details

  • AWQ 4-bit inference support specifically for SM70 / Tesla V100 GPUs.
  • FLASH_ATTN_V100 backend for optimized attention mechanisms on Volta.
  • Comprehensive support for Qwen3.5/Qwen3.6 models, including dense, MoE, and MTP (Multi-Turn Prompting) speculative decoding.
  • Experimental FP8 KV cache support (fp8_e5m2).
  • OpenAI-compatible API serving endpoint.
  • SM70-specific runtime fixes for stability and performance.

Maintenance & Community

The project is primarily driven by "一猫之下" (1CatAI), contributing engineering experience and optimizations. Community interaction is facilitated via a WeChat group (1Cat-vLLM 开源交流群2).

Licensing & Compatibility

This repository adheres to the upstream vLLM license model. Compatibility is focused on Tesla V100 (SM70) hardware; performance on other architectures is not guaranteed. Commercial use is generally permitted under the terms of the vLLM license.

Limitations & Caveats

This fork is highly specialized for SM70 / Tesla V100 hardware and is not intended as a general-purpose vLLM replacement. Multimodal and vision workloads are not the default public profile and require separate tuning. Initial requests on V100 may experience significant compilation delays.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
22
Star History
128 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

1.1%
18k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.