qwen-vllm by owenliang

Qwen LLM inference deployment demo

Created 2 years ago

637 stars

Top 52.1% on SourcePulse

Project Summary

This repository provides a demonstration for deploying and performing inference with the Qwen Large Language Model (LLM) using the vLLM inference engine. It targets developers and researchers looking to set up high-throughput, concurrent LLM serving in production environments, offering both offline and online inference capabilities with streaming responses.

How It Works

The project leverages vLLM's continuous batching mechanism for efficient, high-concurrency inference. For online serving, it utilizes an asyncio-based HTTP server built with FastAPI and Uvicorn. This setup exposes HTTP endpoints, queues incoming requests for vLLM's batch processing, and asynchronously returns results, supporting streamed output via FastAPI's chunked responses. The client-side implementation uses the requests library to consume and display these streamed tokens.

Quick Start & Requirements

Offline Inference: python vllm_offline.py
Online Server: python vllm_server.py
Online Client: python vllm_client.py
WebUI: python gradio_webui.py (after starting the server)
Prerequisites: Python 3.10, CUDA 12.1, PyTorch 2.1.
Dependencies: pip install . -i https://mirrors.aliyun.com/pypi/simple/ (for vLLM GPTQ), pip install modelscope, pip install tiktoken.

Highlighted Details

Demonstrates high-concurrency inference using vLLM's continuous batching.
Supports streaming responses for both server and client.
Includes a Gradio-based WebUI for interactive chat.
Provides examples of Qwen prompt formatting for pre-trained and chat models.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for this repository. It references official vLLM and Qwen implementations, which may have their own licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses on core functionality for production serving and notes that "edge details" were not heavily invested in. Specific hardware requirements beyond CUDA 12.1 are not detailed, and the project's licensing status requires clarification for commercial adoption.

qwen-vllm by owenliang

Explore Similar Projects

xai-sdk-python by xai-org

ChatGptNet by marcominerva

MIGPT by Afool4U

fullstack-ai-chatbot by stephensanwo

chatgpt-clone by afizs

LMArenaBridge by Lianues

OpenAI-DotNet by RageAgainstThePixel

openai-java by openai

ruby-openai by alexrudall

anthropic-sdk-python by anthropics

openai-node by openai

openai-python by openai