Ming by inclusionAI

Unified multimodal model for perception and generation

Created 8 months ago

567 stars

Top 56.7% on SourcePulse

Project Summary

Ming-lite-omni v1.5 is a 20.3 billion parameter multimodal large language model designed for advanced understanding and generation across text, image, video, and audio. It targets researchers and developers seeking a unified model for complex tasks like visual question answering, image editing, and speech processing, offering competitive performance on various benchmarks.

How It Works

Built upon the Ling LLM, Ming-lite-omni v1.5 utilizes a Mixture-of-Experts (MoE) architecture with 3 billion active parameters. This approach allows for efficient scaling and specialization across different modalities, enhancing performance in tasks such as image-text understanding, document analysis, video comprehension, and speech synthesis/recognition. The model emphasizes precise control in image generation and editing, maintaining consistency in scenes and identities.

Quick Start & Requirements

Installation: pip install -r requirements.txt (Python 3.10+ recommended). Specific dependencies like nvidia-cublas-cu12 are required for NVIDIA GPUs. Docker support is available.
Prerequisites: NVIDIA GPU with CUDA 12.1+ is recommended. Loading the model in bfloat16 requires approximately 42GB of GPU memory.
Resources: Model weights can be downloaded from Hugging Face or ModelScope.
Documentation: Project page: https://lucaria-academy.github.io/Ming-Omni/, Technical Report: https://arxiv.org/abs/2506.09344.

Highlighted Details

Achieves state-of-the-art results on OCRBench and ChartQA among models under 10B parameters.
Demonstrates leading performance in video understanding benchmarks like VideoMME and LongVideoBench for its size class.
Offers strong capabilities in multilingual Automatic Speech Recognition (ASR) and Audio Question Answering (QA), supporting multiple dialects.
Features enhanced image generation with improved control over scene/person ID consistency and expanded support for perception tasks.

Maintenance & Community

The project has seen active development with releases of v1.0, a preview version, and the latest v1.5. Links to Hugging Face and ModelScope are provided for model access.

Licensing & Compatibility

The code repository is licensed under the MIT License. A separate legal disclaimer is provided.

Limitations & Caveats

While v1.5 shows improvements, some benchmarks indicate performance parity or slight regressions compared to other models (e.g., MMBench, MMMU, MVBench). Specific hardware configurations (e.g., H20/H800 GPUs) and CUDA versions are noted for optimal performance and deployment.

Ming by inclusionAI

Explore Similar Projects

OneCAT by onecat-ai

bc-omni by westlake-baichuan-mllm

cobra by h-zhao1997

Ovis-U1 by AIDC-AI

MILS by facebookresearch

Awesome-Unified-Multimodal-Models by showlab

Awesome_Matching_Pretraining_Transfering by Paranioar

Lumina-mGPT-2.0 by Alpha-VLLM

magma by Aleph-Alpha-Research

InternLM-XComposer by InternLM

mlx-examples by ml-explore

Janus by deepseek-ai