LLaVA-OneVision-1.5 by EvolvingLMMs-Lab

Efficient multimodal model training framework

Created 3 months ago

681 stars

Top 49.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

LLaVA-OneVision-1.5 offers a fully open-source framework for Large Multimodal Models (LMMs), achieving state-of-the-art performance with reduced training costs. It targets researchers and developers seeking efficient, high-quality LMMs, outperforming models like Qwen2.5-VL via optimized training on native resolution images and curated datasets.

How It Works

This project introduces LMMs trained on native resolution images, leveraging a meticulously curated 64B token dataset for superior data efficiency. The framework builds on MegatronLM, supporting Mixture-of-Experts (MoE), FP8, and long sequence parallelization. Its optimized codebase enables cost-effective scaling, efficient multimodal parallelism, and data load balancing.

Quick Start & Requirements

Primary Install/Run: Docker is recommended for setup, involving cloning, building an image, and running with GPU access. A Python inference example using Hugging Face transformers is provided.
Prerequisites: Docker, NVIDIA GPUs (A100 80GB mentioned for Docker example), Python, accelerate, flash_attention_2, and specific datasets (e.g., LLaVA-558K-Webdataset, LLaVA-NeXT-780K Dataset). Model conversion scripts are included.
Links: LLaVA-OneVision-1.5 GitHub, lmms-eval.

Highlighted Details

Achieves state-of-the-art performance, outperforming Qwen2.5-VL across benchmarks.
Enables cost-effective training ($16,000 budget on A100 GPUs).
Utilizes high-quality, curated data (64B tokens) for enhanced efficiency.
Built on MegatronLM with MoE, FP8, and long sequence parallelization for scalable training.

Maintenance & Community

Acknowledges LLaVA community and Baidu AI Cloud's AIAK team. Draws inspiration from LLaVA, LLaVA-NeXT, lmms-eval, Megatron-LM, Qwen2.5-VL, Qwen3, and MetaCLIP. Roadmap includes ultra-efficient MoE training and full video input LLMs for Q4 2025.

Licensing & Compatibility

The provided README does not specify a software license, potentially impacting commercial use or integration into closed-source projects.

Limitations & Caveats

Mid-training and instruction-tuning datasets are indicated as uploading. More detailed reproduction steps will be released post-dataset upload, suggesting current reproduction may be incomplete or require further data acquisition.

LLaVA-OneVision-1.5 by EvolvingLMMs-Lab

Explore Similar Projects

fms-fsdp by foundation-model-stack

libai by Oneflow-Inc

TinyLLaVA_Factory by TinyLLaVA

LLMBox by RUCAIBox

NeMo-Framework-Launcher by NVIDIA

TencentPretrain by Tencent

Qwen-VL-Series-Finetune by 2U1

Megatron-DeepSpeed by bigscience-workshop

OneTrainer by Nerogar

nanotron by huggingface

FlagAI by FlagAI-Open

ColossalAI by hpcaitech