Infini-Megrez  by infinigence

AI model for edge-side intelligence, optimized for speed

created 10 months ago
325 stars

Top 85.0% on sourcepulse

GitHubView on GitHub
Project Summary

Megrez-3B-Omni is a 3B parameter, multimodal LLM designed for efficient on-device inference, offering text, image, and audio understanding capabilities. It aims to provide state-of-the-art performance in image comprehension while maintaining strong language and introducing audio processing, making it suitable for developers seeking a compact yet powerful AI solution.

How It Works

Megrez-3B-Omni extends the Megrez-3B-Instruct LLM by incorporating image understanding via SigLip-400M image tokens and audio processing using Qwen2-Audio/whisper-large-v3 encoders. This multimodal architecture allows for unified processing of text, images, and audio. The model leverages a LLaMA-like structure for ease of deployment and soft-hard collaborative optimization for high inference speed, claiming to outperform larger models on various benchmarks.

Quick Start & Requirements

  • Install/Run: Use Hugging Face transformers library.
  • Prerequisites: PyTorch, transformers, flash_attention_2 (recommended). GPU with CUDA is required for optimal performance.
  • Resources: Model weights are available on Hugging Face.
  • Docs: Infini-Megrez-Omni, HF Chat Demo

Highlighted Details

  • Achieves top accuracy in image understanding benchmarks like OpenCompass, MME, MMMU, and OCRBench, surpassing larger models like LLaVA-NeXT-Yi-34B.
  • Maintains over 98% of its text-only performance on benchmarks like C-EVAL and MMLU (Pro).
  • Offers competitive inference speeds, with Megrez-3B-Omni achieving 1294.9 tokens/s decode speed on an H100.
  • Includes a WebSearch solution trained for automatic query generation and result summarization.

Maintenance & Community

  • Developed by Infinigence AI.
  • Community links: WeChat Official Account, WeChat Groups (Chinese and English).

Licensing & Compatibility

  • Code is licensed under Apache-2.0.
  • Model usage is subject to the disclaimer regarding inherent hallucination issues and potential unforeseen problems with training data.

Limitations & Caveats

The model acknowledges inherent hallucination issues common to LLMs. For OCR tasks, disabling sampling (sampling=False) is recommended to mitigate potential hallucinations, though this may lead to repetition. Image inputs are best provided in the first turn for optimal results.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.