AI model for edge-side intelligence, optimized for speed
Top 85.0% on sourcepulse
Megrez-3B-Omni is a 3B parameter, multimodal LLM designed for efficient on-device inference, offering text, image, and audio understanding capabilities. It aims to provide state-of-the-art performance in image comprehension while maintaining strong language and introducing audio processing, making it suitable for developers seeking a compact yet powerful AI solution.
How It Works
Megrez-3B-Omni extends the Megrez-3B-Instruct LLM by incorporating image understanding via SigLip-400M image tokens and audio processing using Qwen2-Audio/whisper-large-v3 encoders. This multimodal architecture allows for unified processing of text, images, and audio. The model leverages a LLaMA-like structure for ease of deployment and soft-hard collaborative optimization for high inference speed, claiming to outperform larger models on various benchmarks.
Quick Start & Requirements
transformers
library.transformers
, flash_attention_2
(recommended). GPU with CUDA is required for optimal performance.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model acknowledges inherent hallucination issues common to LLMs. For OCR tasks, disabling sampling (sampling=False
) is recommended to mitigate potential hallucinations, though this may lead to repetition. Image inputs are best provided in the first turn for optimal results.
1 week ago
Inactive