Discover and explore top open-source AI tools and projects—updated daily.
EvolvingLMMs-LabEfficient multimodal model training framework
Top 54.6% on SourcePulse
LLaVA-OneVision-1.5 offers a fully open-source framework for Large Multimodal Models (LMMs), achieving state-of-the-art performance with reduced training costs. It targets researchers and developers seeking efficient, high-quality LMMs, outperforming models like Qwen2.5-VL via optimized training on native resolution images and curated datasets.
How It Works
This project introduces LMMs trained on native resolution images, leveraging a meticulously curated 64B token dataset for superior data efficiency. The framework builds on MegatronLM, supporting Mixture-of-Experts (MoE), FP8, and long sequence parallelization. Its optimized codebase enables cost-effective scaling, efficient multimodal parallelism, and data load balancing.
Quick Start & Requirements
transformers is provided.accelerate, flash_attention_2, and specific datasets (e.g., LLaVA-558K-Webdataset, LLaVA-NeXT-780K Dataset). Model conversion scripts are included.Highlighted Details
Maintenance & Community
Acknowledges LLaVA community and Baidu AI Cloud's AIAK team. Draws inspiration from LLaVA, LLaVA-NeXT, lmms-eval, Megatron-LM, Qwen2.5-VL, Qwen3, and MetaCLIP. Roadmap includes ultra-efficient MoE training and full video input LLMs for Q4 2025.
Licensing & Compatibility
The provided README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.
Limitations & Caveats
Mid-training and instruction-tuning datasets are indicated as uploading. More detailed reproduction steps will be released post-dataset upload, suggesting current reproduction may be incomplete or require further data acquisition.
2 days ago
Inactive
foundation-model-stack
bigscience-workshop
huggingface
hpcaitech