Keye by Kwai-Keye

Multimodal LLM for video and image understanding

Created 8 months ago

717 stars

Top 47.9% on SourcePulse

Project Summary

Kwai Keye-VL is a multimodal large language model designed for advanced video understanding, visual perception, and reasoning tasks. It targets researchers and developers seeking state-of-the-art performance in processing complex visual and textual data, offering significant improvements over comparable models in video comprehension and logical problem-solving.

How It Works

Keye-VL builds upon the Qwen3-8B architecture, integrating a SigLIP vision encoder. It employs 3D RoPE for unified text, image, and video processing, enabling precise temporal perception. Images are handled via a 14x14 patch sequence with dynamic resolution and aspect ratio preservation, mapped by an MLP. The model's training involves a four-stage progressive strategy for pre-training and a two-phase, five-stage approach for post-training, emphasizing Chain of Thought (CoT) reasoning and reinforcement learning for complex cognitive tasks.

Quick Start & Requirements

Install: pip install keye-vl-utils
Prerequisites: CUDA, flash_attention_2 (recommended), vllm (for deployment).
Usage: Load model via transformers or deploy with vLLM. Supports local file paths, URLs, and base64 for images and videos.
Links: Home Page, Technical Report, Demo, Models

Highlighted Details

State-of-the-art performance on video understanding benchmarks (Video-MME, Video-MMMU, etc.).
Strong capabilities in logical reasoning and mathematical problem-solving (WeMath, MathVerse).
Advanced training methodology including CoT, mixed-mode RL, and iterative alignment for enhanced reasoning.
Dynamic resolution and aspect ratio preservation for images.
Unified processing of text, image, and video using 3D RoPE.

Maintenance & Community

Developed by the Kwai Keye Team at Kuaishou. The project is actively updated with news and technical reports. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The model is released under a permissive license, allowing for commercial use and integration with closed-source applications. It is based on Qwen3 and SigLIP, whose licenses should also be considered.

Limitations & Caveats

The README mentions a "Preview" version, suggesting potential for ongoing development and changes. Specific hardware requirements for optimal performance (e.g., GPU memory) are not detailed but are implied by the use of flash_attention_2 and vLLM.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days