1Cat-vLLM by 1CatAI

LLM inference acceleration for legacy GPUs

Created 4 months ago

498 stars

Top 61.6% on SourcePulse

Project Summary

This project provides a specialized fork of vLLM, engineered to significantly enhance the performance and usability of large language models on older Tesla V100 (SM70) GPUs. It targets developers and teams seeking to leverage existing V100 hardware for modern AI workloads, offering substantial improvements in inference speed, context handling, and deployment stability for AWQ-quantized models. The primary benefit is revitalizing V100 hardware, making previously challenging LLM deployments feasible and efficient.

How It Works

1Cat-vLLM achieves its performance gains through a systematic re-engineering of vLLM for SM70 architectures. It integrates lmdeploy TurboMind SM70 WMMA kernels and the FLASH_ATTN_V100 attention backend, specifically optimizing for Volta GPUs. This approach overcomes the SM75+ requirement of upstream vLLM AWQ kernels, enabling V100s to efficiently serve modern AWQ-quantized models, including dense and Mixture-of-Experts (MoE) variants like Qwen3.5/3.6, with improved long-context stability and faster inference.

Quick Start & Requirements

Primary install / run command: Recommended installation via prebuilt wheels:

python -m pip install --prefer-binary --no-cache-dir \
  --extra-index-url https://download.pytorch.org/whl/cu128 \
  "https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/flash_attn_v100-1.0.0-cp312-cp312-linux_x86_64.whl" \
  "https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/vllm-1.0.0-cp312-cp312-linux_x86_64.whl"

Non-default prerequisites: Python 3.12, CUDA 12.8, PyTorch 2.9.1+cu128, NVIDIA Driver 570.211.01. Primarily validated on 4x Tesla V100 32 GB systems.
Estimated setup time: Docker builds involve multi-gigabyte downloads. First-time kernel compilation on V100 can take 1-3 minutes per request.
Links:
- Release Wheels: GitHub Releases
- CUDA 12.8 Install (Ubuntu): wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb

Highlighted Details

AWQ 4-bit inference support specifically for SM70 / Tesla V100 GPUs.
FLASH_ATTN_V100 backend for optimized attention mechanisms on Volta.
Comprehensive support for Qwen3.5/Qwen3.6 models, including dense, MoE, and MTP (Multi-Turn Prompting) speculative decoding.
Experimental FP8 KV cache support (fp8_e5m2).
OpenAI-compatible API serving endpoint.
SM70-specific runtime fixes for stability and performance.

Maintenance & Community

The project is primarily driven by "一猫之下" (1CatAI), contributing engineering experience and optimizations. Community interaction is facilitated via a WeChat group (1Cat-vLLM 开源交流群2).

Licensing & Compatibility

This repository adheres to the upstream vLLM license model. Compatibility is focused on Tesla V100 (SM70) hardware; performance on other architectures is not guaranteed. Commercial use is generally permitted under the terms of the vLLM license.

Limitations & Caveats

This fork is highly specialized for SM70 / Tesla V100 hardware and is not intended as a general-purpose vLLM replacement. Multimodal and vision workloads are not the default public profile and require separate tuning. Initial requests on V100 may experience significant compilation delays.

1Cat-vLLM by 1CatAI

Explore Similar Projects

kaiwu by val1813

ntransformer by xaskasdf

DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 by albond

llama.cpp-deepseek-v4-flash by antirez

omniserve by mit-han-lab

buun-llama-cpp by spiritbuun

OSCAR by FutureMLS-Lab

atlas by Avarok-Cybersecurity

GPTQModel by ModelCloud

picolm by RightNow-AI

ollm by Mega4alik

airllm by lyogavin