Keye  by Kwai-Keye

Multimodal LLM for video and image understanding

Created 6 months ago
710 stars

Top 48.3% on SourcePulse

GitHubView on GitHub
Project Summary

Kwai Keye-VL is a multimodal large language model designed for advanced video understanding, visual perception, and reasoning tasks. It targets researchers and developers seeking state-of-the-art performance in processing complex visual and textual data, offering significant improvements over comparable models in video comprehension and logical problem-solving.

How It Works

Keye-VL builds upon the Qwen3-8B architecture, integrating a SigLIP vision encoder. It employs 3D RoPE for unified text, image, and video processing, enabling precise temporal perception. Images are handled via a 14x14 patch sequence with dynamic resolution and aspect ratio preservation, mapped by an MLP. The model's training involves a four-stage progressive strategy for pre-training and a two-phase, five-stage approach for post-training, emphasizing Chain of Thought (CoT) reasoning and reinforcement learning for complex cognitive tasks.

Quick Start & Requirements

  • Install: pip install keye-vl-utils
  • Prerequisites: CUDA, flash_attention_2 (recommended), vllm (for deployment).
  • Usage: Load model via transformers or deploy with vLLM. Supports local file paths, URLs, and base64 for images and videos.
  • Links: Home Page, Technical Report, Demo, Models

Highlighted Details

  • State-of-the-art performance on video understanding benchmarks (Video-MME, Video-MMMU, etc.).
  • Strong capabilities in logical reasoning and mathematical problem-solving (WeMath, MathVerse).
  • Advanced training methodology including CoT, mixed-mode RL, and iterative alignment for enhanced reasoning.
  • Dynamic resolution and aspect ratio preservation for images.
  • Unified processing of text, image, and video using 3D RoPE.

Maintenance & Community

Developed by the Kwai Keye Team at Kuaishou. The project is actively updated with news and technical reports. Links to community channels are not explicitly provided in the README.

Licensing & Compatibility

The model is released under a permissive license, allowing for commercial use and integration with closed-source applications. It is based on Qwen3 and SigLIP, whose licenses should also be considered.

Limitations & Caveats

The README mentions a "Preview" version, suggesting potential for ongoing development and changes. Specific hardware requirements for optimal performance (e.g., GPU memory) are not detailed but are implied by the use of flash_attention_2 and vLLM.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.3%
6k
Transformer library with extensive experimental features
Created 5 years ago
Updated 5 days ago
Feedback? Help us improve.