LLaVA-OneVision-1.5  by EvolvingLMMs-Lab

Efficient multimodal model training framework

Created 3 months ago
681 stars

Top 49.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLaVA-OneVision-1.5 offers a fully open-source framework for Large Multimodal Models (LMMs), achieving state-of-the-art performance with reduced training costs. It targets researchers and developers seeking efficient, high-quality LMMs, outperforming models like Qwen2.5-VL via optimized training on native resolution images and curated datasets.

How It Works

This project introduces LMMs trained on native resolution images, leveraging a meticulously curated 64B token dataset for superior data efficiency. The framework builds on MegatronLM, supporting Mixture-of-Experts (MoE), FP8, and long sequence parallelization. Its optimized codebase enables cost-effective scaling, efficient multimodal parallelism, and data load balancing.

Quick Start & Requirements

  • Primary Install/Run: Docker is recommended for setup, involving cloning, building an image, and running with GPU access. A Python inference example using Hugging Face transformers is provided.
  • Prerequisites: Docker, NVIDIA GPUs (A100 80GB mentioned for Docker example), Python, accelerate, flash_attention_2, and specific datasets (e.g., LLaVA-558K-Webdataset, LLaVA-NeXT-780K Dataset). Model conversion scripts are included.
  • Links: LLaVA-OneVision-1.5 GitHub, lmms-eval.

Highlighted Details

  • Achieves state-of-the-art performance, outperforming Qwen2.5-VL across benchmarks.
  • Enables cost-effective training ($16,000 budget on A100 GPUs).
  • Utilizes high-quality, curated data (64B tokens) for enhanced efficiency.
  • Built on MegatronLM with MoE, FP8, and long sequence parallelization for scalable training.

Maintenance & Community

Acknowledges LLaVA community and Baidu AI Cloud's AIAK team. Draws inspiration from LLaVA, LLaVA-NeXT, lmms-eval, Megatron-LM, Qwen2.5-VL, Qwen3, and MetaCLIP. Roadmap includes ultra-efficient MoE training and full video input LLMs for Q4 2025.

Licensing & Compatibility

The provided README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.

Limitations & Caveats

Mid-training and instruction-tuning datasets are indicated as uploading. More detailed reproduction steps will be released post-dataset upload, suggesting current reproduction may be incomplete or require further data acquisition.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
12
Star History
28 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.7%
278
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.