LLaVA-OneVision-1.5  by EvolvingLMMs-Lab

Efficient multimodal model training framework

Created 1 month ago
596 stars

Top 54.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLaVA-OneVision-1.5 offers a fully open-source framework for Large Multimodal Models (LMMs), achieving state-of-the-art performance with reduced training costs. It targets researchers and developers seeking efficient, high-quality LMMs, outperforming models like Qwen2.5-VL via optimized training on native resolution images and curated datasets.

How It Works

This project introduces LMMs trained on native resolution images, leveraging a meticulously curated 64B token dataset for superior data efficiency. The framework builds on MegatronLM, supporting Mixture-of-Experts (MoE), FP8, and long sequence parallelization. Its optimized codebase enables cost-effective scaling, efficient multimodal parallelism, and data load balancing.

Quick Start & Requirements

  • Primary Install/Run: Docker is recommended for setup, involving cloning, building an image, and running with GPU access. A Python inference example using Hugging Face transformers is provided.
  • Prerequisites: Docker, NVIDIA GPUs (A100 80GB mentioned for Docker example), Python, accelerate, flash_attention_2, and specific datasets (e.g., LLaVA-558K-Webdataset, LLaVA-NeXT-780K Dataset). Model conversion scripts are included.
  • Links: LLaVA-OneVision-1.5 GitHub, lmms-eval.

Highlighted Details

  • Achieves state-of-the-art performance, outperforming Qwen2.5-VL across benchmarks.
  • Enables cost-effective training ($16,000 budget on A100 GPUs).
  • Utilizes high-quality, curated data (64B tokens) for enhanced efficiency.
  • Built on MegatronLM with MoE, FP8, and long sequence parallelization for scalable training.

Maintenance & Community

Acknowledges LLaVA community and Baidu AI Cloud's AIAK team. Draws inspiration from LLaVA, LLaVA-NeXT, lmms-eval, Megatron-LM, Qwen2.5-VL, Qwen3, and MetaCLIP. Roadmap includes ultra-efficient MoE training and full video input LLMs for Q4 2025.

Licensing & Compatibility

The provided README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.

Limitations & Caveats

Mid-training and instruction-tuning datasets are indicated as uploading. More detailed reproduction steps will be released post-dataset upload, suggesting current reproduction may be incomplete or require further data acquisition.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
20
Star History
166 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0%
270
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 3 months ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
27 more.

ColossalAI by hpcaitech

0.0%
41k
AI system for large-scale parallel training
Created 4 years ago
Updated 3 weeks ago
Feedback? Help us improve.