LLaVA-OneVision-1.5  by EvolvingLMMs-Lab

Efficient multimodal model training framework

Created 5 months ago
751 stars

Top 46.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLaVA-OneVision-1.5 offers a fully open-source framework for Large Multimodal Models (LMMs), achieving state-of-the-art performance with reduced training costs. It targets researchers and developers seeking efficient, high-quality LMMs, outperforming models like Qwen2.5-VL via optimized training on native resolution images and curated datasets.

How It Works

This project introduces LMMs trained on native resolution images, leveraging a meticulously curated 64B token dataset for superior data efficiency. The framework builds on MegatronLM, supporting Mixture-of-Experts (MoE), FP8, and long sequence parallelization. Its optimized codebase enables cost-effective scaling, efficient multimodal parallelism, and data load balancing.

Quick Start & Requirements

  • Primary Install/Run: Docker is recommended for setup, involving cloning, building an image, and running with GPU access. A Python inference example using Hugging Face transformers is provided.
  • Prerequisites: Docker, NVIDIA GPUs (A100 80GB mentioned for Docker example), Python, accelerate, flash_attention_2, and specific datasets (e.g., LLaVA-558K-Webdataset, LLaVA-NeXT-780K Dataset). Model conversion scripts are included.
  • Links: LLaVA-OneVision-1.5 GitHub, lmms-eval.

Highlighted Details

  • Achieves state-of-the-art performance, outperforming Qwen2.5-VL across benchmarks.
  • Enables cost-effective training ($16,000 budget on A100 GPUs).
  • Utilizes high-quality, curated data (64B tokens) for enhanced efficiency.
  • Built on MegatronLM with MoE, FP8, and long sequence parallelization for scalable training.

Maintenance & Community

Acknowledges LLaVA community and Baidu AI Cloud's AIAK team. Draws inspiration from LLaVA, LLaVA-NeXT, lmms-eval, Megatron-LM, Qwen2.5-VL, Qwen3, and MetaCLIP. Roadmap includes ultra-efficient MoE training and full video input LLMs for Q4 2025.

Licensing & Compatibility

The provided README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.

Limitations & Caveats

Mid-training and instruction-tuning datasets are indicated as uploading. More detailed reproduction steps will be released post-dataset upload, suggesting current reproduction may be incomplete or require further data acquisition.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
5
Star History
52 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.