Discover and explore top open-source AI tools and projects—updated daily.
Kwai-KeyeMultimodal LLM for video and image understanding
Top 48.3% on SourcePulse
Kwai Keye-VL is a multimodal large language model designed for advanced video understanding, visual perception, and reasoning tasks. It targets researchers and developers seeking state-of-the-art performance in processing complex visual and textual data, offering significant improvements over comparable models in video comprehension and logical problem-solving.
How It Works
Keye-VL builds upon the Qwen3-8B architecture, integrating a SigLIP vision encoder. It employs 3D RoPE for unified text, image, and video processing, enabling precise temporal perception. Images are handled via a 14x14 patch sequence with dynamic resolution and aspect ratio preservation, mapped by an MLP. The model's training involves a four-stage progressive strategy for pre-training and a two-phase, five-stage approach for post-training, emphasizing Chain of Thought (CoT) reasoning and reinforcement learning for complex cognitive tasks.
Quick Start & Requirements
pip install keye-vl-utilsflash_attention_2 (recommended), vllm (for deployment).transformers or deploy with vLLM. Supports local file paths, URLs, and base64 for images and videos.Highlighted Details
Maintenance & Community
Developed by the Kwai Keye Team at Kuaishou. The project is actively updated with news and technical reports. Links to community channels are not explicitly provided in the README.
Licensing & Compatibility
The model is released under a permissive license, allowing for commercial use and integration with closed-source applications. It is based on Qwen3 and SigLIP, whose licenses should also be considered.
Limitations & Caveats
The README mentions a "Preview" version, suggesting potential for ongoing development and changes. Specific hardware requirements for optimal performance (e.g., GPU memory) are not detailed but are implied by the use of flash_attention_2 and vLLM.
1 month ago
Inactive
baaivision
InternLM
NExT-GPT
lucidrains