qwen-vllm  by owenliang

Qwen LLM inference deployment demo

created 1 year ago
592 stars

Top 55.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a demonstration for deploying and performing inference with the Qwen Large Language Model (LLM) using the vLLM inference engine. It targets developers and researchers looking to set up high-throughput, concurrent LLM serving in production environments, offering both offline and online inference capabilities with streaming responses.

How It Works

The project leverages vLLM's continuous batching mechanism for efficient, high-concurrency inference. For online serving, it utilizes an asyncio-based HTTP server built with FastAPI and Uvicorn. This setup exposes HTTP endpoints, queues incoming requests for vLLM's batch processing, and asynchronously returns results, supporting streamed output via FastAPI's chunked responses. The client-side implementation uses the requests library to consume and display these streamed tokens.

Quick Start & Requirements

  • Offline Inference: python vllm_offline.py
  • Online Server: python vllm_server.py
  • Online Client: python vllm_client.py
  • WebUI: python gradio_webui.py (after starting the server)
  • Prerequisites: Python 3.10, CUDA 12.1, PyTorch 2.1.
  • Dependencies: pip install . -i https://mirrors.aliyun.com/pypi/simple/ (for vLLM GPTQ), pip install modelscope, pip install tiktoken.

Highlighted Details

  • Demonstrates high-concurrency inference using vLLM's continuous batching.
  • Supports streaming responses for both server and client.
  • Includes a Gradio-based WebUI for interactive chat.
  • Provides examples of Qwen prompt formatting for pre-trained and chat models.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for this repository. It references official vLLM and Qwen implementations, which may have their own licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses on core functionality for production serving and notes that "edge details" were not heavily invested in. Specific hardware requirements beyond CUDA 12.1 are not detailed, and the project's licensing status requires clarification for commercial adoption.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 4 hours ago
Feedback? Help us improve.