Ming  by inclusionAI

Unified multimodal model for perception and generation

created 3 months ago
437 stars

Top 68.1% on SourcePulse

GitHubView on GitHub
Project Summary

Ming-lite-omni v1.5 is a 20.3 billion parameter multimodal large language model designed for advanced understanding and generation across text, image, video, and audio. It targets researchers and developers seeking a unified model for complex tasks like visual question answering, image editing, and speech processing, offering competitive performance on various benchmarks.

How It Works

Built upon the Ling LLM, Ming-lite-omni v1.5 utilizes a Mixture-of-Experts (MoE) architecture with 3 billion active parameters. This approach allows for efficient scaling and specialization across different modalities, enhancing performance in tasks such as image-text understanding, document analysis, video comprehension, and speech synthesis/recognition. The model emphasizes precise control in image generation and editing, maintaining consistency in scenes and identities.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt (Python 3.10+ recommended). Specific dependencies like nvidia-cublas-cu12 are required for NVIDIA GPUs. Docker support is available.
  • Prerequisites: NVIDIA GPU with CUDA 12.1+ is recommended. Loading the model in bfloat16 requires approximately 42GB of GPU memory.
  • Resources: Model weights can be downloaded from Hugging Face or ModelScope.
  • Documentation: Project page: https://lucaria-academy.github.io/Ming-Omni/, Technical Report: https://arxiv.org/abs/2506.09344.

Highlighted Details

  • Achieves state-of-the-art results on OCRBench and ChartQA among models under 10B parameters.
  • Demonstrates leading performance in video understanding benchmarks like VideoMME and LongVideoBench for its size class.
  • Offers strong capabilities in multilingual Automatic Speech Recognition (ASR) and Audio Question Answering (QA), supporting multiple dialects.
  • Features enhanced image generation with improved control over scene/person ID consistency and expanded support for perception tasks.

Maintenance & Community

The project has seen active development with releases of v1.0, a preview version, and the latest v1.5. Links to Hugging Face and ModelScope are provided for model access.

Licensing & Compatibility

The code repository is licensed under the MIT License. A separate legal disclaimer is provided.

Limitations & Caveats

While v1.5 shows improvements, some benchmarks indicate performance parity or slight regressions compared to other models (e.g., MMBench, MMMU, MVBench). Specific hardware configurations (e.g., H20/H800 GPUs) and CUDA versions are noted for optimal performance and deployment.

Health Check
Last commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)
9
Issues (30d)
4
Star History
54 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Lianmin Zheng Lianmin Zheng(Author of SGLang), and
2 more.

HunyuanVideo by Tencent-Hunyuan

0.3%
11k
PyTorch code for video generation research
created 8 months ago
updated 1 week ago
Feedback? Help us improve.