MGM-Omni  by dvlab-research

Omni-modal LLM for personalized long-horizon speech and multi-input understanding

Created 2 months ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

MGM-Omni addresses the challenge of creating versatile AI assistants capable of processing and generating long-form speech across multiple modalities. Targeting researchers and developers, it offers a unified framework for understanding text, image, video, and speech, with the unique ability to generate extended, natural-sounding speech and clone voices. This project aims to significantly advance the capabilities of open-source multi-modal large language models.

How It Works

The architecture leverages modality-specific encoders to process diverse inputs, feeding features into a Multi-modal Large Language Model (MLLM). The MLLM's text output is then directed to a SpeechLM component, which employs a novel Chunk-Based Parallel Decoding strategy for efficient speech token generation. These tokens are converted into Mel-Spectrograms via a Flow Matching model and finally synthesized into audio using a vocoder. This approach enables robust handling of hour-long audio inputs and facilitates smooth, long-form speech generation.

Quick Start & Requirements

Installation involves cloning the repository, setting up a Conda environment with Python 3.10, initializing Git submodules, and installing the package (pip install -e .). Key resources include the technical report, a blog post detailing the project, Hugging Face links for models and demos, and a benchmark dataset for long-form TTS evaluation.

Highlighted Details

  • Omni-modality: Supports text, image, video, and audio inputs.
  • Long-form Speech: Capable of understanding hour-long audio and generating over 10 minutes of continuous speech.
  • Personalization: Features zero-shot voice cloning from short audio clips (~10 seconds).
  • Efficiency: Enables streaming audio generation through parallel decoding.
  • Performance: Demonstrates competitive results on speech understanding and generation benchmarks, outperforming several existing models.

Maintenance & Community

The project is presented as fully open-source, with models and code available on Hugging Face. While specific community channels like Discord or Slack are not detailed, the Hugging Face platform serves as a hub for models and discussions. The project acknowledges significant contributions from related open-source efforts.

Licensing & Compatibility

The README declares the project as "Fully Open-source" but omits specific license details (e.g., MIT, Apache). This lack of explicit licensing information presents a significant barrier for assessing commercial use compatibility and potential adoption restrictions.

Limitations & Caveats

Training and fine-tuning code are explicitly marked as "TODO" and are not yet released, limiting immediate user experimentation beyond inference. The README's use of future dates for model releases warrants careful consideration regarding the project's current development stage and stability.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
94 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.