tiny-qwen by Emericen

Minimal multimodal AI model re-implementation

Created 1 year ago

321 stars

Top 84.9% on SourcePulse

Project Summary

A minimal, easy-to-read PyTorch re-implementation of the Qwen3 VL model, targeting engineers and researchers seeking a clear, foundational understanding or a flexible base for multimodal AI projects. It simplifies access to Qwen3 VL's text and vision capabilities, providing a streamlined alternative to official implementations.

How It Works

The project reconstructs Qwen3 VL's architecture in PyTorch with an emphasis on code readability and minimal dependencies. It supports both text and vision inputs, processing them through a model that can utilize dense or Mixture of Experts (MoE) configurations. This design choice facilitates easier comprehension of the underlying mechanisms and allows for straightforward experimentation with multimodal transformer models.

Quick Start & Requirements

Installation: Set up a virtual environment using uv (pip install uv, uv venv, source .venv/bin/activate), then install project dependencies with uv pip install -r requirements.txt.
Prerequisites: Python, PyTorch. The code examples utilize huggingface_hub for model weights and PIL for image handling. CUDA is recommended for optimal performance.
Running: Initiate an interactive chat session via the command python run.py. Images can be referenced within prompts using the @relative/path/to/image.jpg syntax.
Documentation: Code examples demonstrating model loading, processing, and generation are included directly in the README.

Highlighted Details

Provides a minimal, highly readable PyTorch re-implementation of Qwen3 VL.
Fully supports multimodal input, processing both text and vision data.
Accommodates both dense and Mixture of Experts (MoE) model variants.
Features a user-friendly "fancy CLI" for direct interaction and chat.
Code examples showcase programmatic use with model.generate and model.generate_stream.

Maintenance & Community

A dedicated Discord channel is available for community discussions and support.
The project is actively maintained by the developer Emericen.

Licensing & Compatibility

The specific open-source license for this repository is not detailed in the provided README content.

Limitations & Caveats

As a re-implementation, it may not perfectly mirror the performance characteristics or specific optimizations of the official Qwen3 VL.
The README directs users to a separate branch for Qwen3 (text-only) and Qwen2.5 VL support.
Integration with other models like DeepSeek R1 is explicitly handled by different repositories.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

2

Issues (30d)

1

Star History

13 stars in the last 30 days

Explore Similar Projects

kolosal-cli by KolosalAI

AI command-line workflow tool for developers

Created 8 months ago

Updated 2 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

Updated 1 year ago

seemore by AviSoori1x

Build and understand vision-language models from scratch

Created 1 year ago

Updated 1 year ago

lmms-finetune by zjysteven

Minimal codebase for finetuning large multimodal models

Created 1 year ago

Updated 2 months ago

Qwen3-SmVL by ShaohonChen

Combine Qwen3 and SmolVLM2 for Chinese multimodal understanding

Created 7 months ago

Updated 5 months ago

BLIP3o by JiuhaiChen

Unified multimodal model combining reasoning with generative diffusion

Created 10 months ago

Updated 2 months ago

ComfyUI-QwenVL by 1038lab

Multimodal AI integration for ComfyUI

Created 11 months ago

Updated 2 weeks ago

biniou by Woolverine94

Self-hosted web UI for generative AI multimedia content creation and chatbot use

Created 2 years ago

Updated 1 day ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Bagel by ByteDance-Seed

Unified multimodal foundation model

Created 10 months ago

Updated 4 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

DeepSeek-VL by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 2 years ago

Updated 1 year ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

gallery by google-ai-edge

Experimental app for on-device GenAI model exploration

Created 11 months ago

Updated 1 day ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Lilian Weng

Lilian Weng(Cofounder of Thinking Machines Lab), and

99 more.

transformers by huggingface

ML library for pretrained model inference and training

Created 7 years ago

Updated 22 hours ago

Feedback? Help us improve.