Infini-Megrez by infinigence

AI model for edge-side intelligence, optimized for speed

Created 1 year ago

337 stars

Top 81.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Megrez-3B-Omni is a 3B parameter, multimodal LLM designed for efficient on-device inference, offering text, image, and audio understanding capabilities. It aims to provide state-of-the-art performance in image comprehension while maintaining strong language and introducing audio processing, making it suitable for developers seeking a compact yet powerful AI solution.

How It Works

Megrez-3B-Omni extends the Megrez-3B-Instruct LLM by incorporating image understanding via SigLip-400M image tokens and audio processing using Qwen2-Audio/whisper-large-v3 encoders. This multimodal architecture allows for unified processing of text, images, and audio. The model leverages a LLaMA-like structure for ease of deployment and soft-hard collaborative optimization for high inference speed, claiming to outperform larger models on various benchmarks.

Quick Start & Requirements

Install/Run: Use Hugging Face transformers library.
Prerequisites: PyTorch, transformers, flash_attention_2 (recommended). GPU with CUDA is required for optimal performance.
Resources: Model weights are available on Hugging Face.
Docs: Infini-Megrez-Omni, HF Chat Demo

Highlighted Details

Achieves top accuracy in image understanding benchmarks like OpenCompass, MME, MMMU, and OCRBench, surpassing larger models like LLaVA-NeXT-Yi-34B.
Maintains over 98% of its text-only performance on benchmarks like C-EVAL and MMLU (Pro).
Offers competitive inference speeds, with Megrez-3B-Omni achieving 1294.9 tokens/s decode speed on an H100.
Includes a WebSearch solution trained for automatic query generation and result summarization.

Maintenance & Community

Developed by Infinigence AI.
Community links: WeChat Official Account, WeChat Groups (Chinese and English).

Licensing & Compatibility

Code is licensed under Apache-2.0.
Model usage is subject to the disclaimer regarding inherent hallucination issues and potential unforeseen problems with training data.

Limitations & Caveats

The model acknowledges inherent hallucination issues common to LLMs. For OCR tasks, disabling sampling (sampling=False) is recommended to mitigate potential hallucinations, though this may lead to repetition. Image inputs are best provided in the first turn for optimal results.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days