Unified multimodal model for perception and generation
Top 68.1% on SourcePulse
Ming-lite-omni v1.5 is a 20.3 billion parameter multimodal large language model designed for advanced understanding and generation across text, image, video, and audio. It targets researchers and developers seeking a unified model for complex tasks like visual question answering, image editing, and speech processing, offering competitive performance on various benchmarks.
How It Works
Built upon the Ling LLM, Ming-lite-omni v1.5 utilizes a Mixture-of-Experts (MoE) architecture with 3 billion active parameters. This approach allows for efficient scaling and specialization across different modalities, enhancing performance in tasks such as image-text understanding, document analysis, video comprehension, and speech synthesis/recognition. The model emphasizes precise control in image generation and editing, maintaining consistency in scenes and identities.
Quick Start & Requirements
pip install -r requirements.txt
(Python 3.10+ recommended). Specific dependencies like nvidia-cublas-cu12
are required for NVIDIA GPUs. Docker support is available.bfloat16
requires approximately 42GB of GPU memory.Highlighted Details
Maintenance & Community
The project has seen active development with releases of v1.0, a preview version, and the latest v1.5. Links to Hugging Face and ModelScope are provided for model access.
Licensing & Compatibility
The code repository is licensed under the MIT License. A separate legal disclaimer is provided.
Limitations & Caveats
While v1.5 shows improvements, some benchmarks indicate performance parity or slight regressions compared to other models (e.g., MMBench, MMMU, MVBench). Specific hardware configurations (e.g., H20/H800 GPUs) and CUDA versions are noted for optimal performance and deployment.
4 days ago
1 week