Discover and explore top open-source AI tools and projects—updated daily.
dvlab-researchOmni-modal LLM for personalized long-horizon speech and multi-input understanding
Top 99.6% on SourcePulse
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
MGM-Omni addresses the challenge of creating versatile AI assistants capable of processing and generating long-form speech across multiple modalities. Targeting researchers and developers, it offers a unified framework for understanding text, image, video, and speech, with the unique ability to generate extended, natural-sounding speech and clone voices. This project aims to significantly advance the capabilities of open-source multi-modal large language models.
How It Works
The architecture leverages modality-specific encoders to process diverse inputs, feeding features into a Multi-modal Large Language Model (MLLM). The MLLM's text output is then directed to a SpeechLM component, which employs a novel Chunk-Based Parallel Decoding strategy for efficient speech token generation. These tokens are converted into Mel-Spectrograms via a Flow Matching model and finally synthesized into audio using a vocoder. This approach enables robust handling of hour-long audio inputs and facilitates smooth, long-form speech generation.
Quick Start & Requirements
Installation involves cloning the repository, setting up a Conda environment with Python 3.10, initializing Git submodules, and installing the package (pip install -e .). Key resources include the technical report, a blog post detailing the project, Hugging Face links for models and demos, and a benchmark dataset for long-form TTS evaluation.
Highlighted Details
Maintenance & Community
The project is presented as fully open-source, with models and code available on Hugging Face. While specific community channels like Discord or Slack are not detailed, the Hugging Face platform serves as a hub for models and discussions. The project acknowledges significant contributions from related open-source efforts.
Licensing & Compatibility
The README declares the project as "Fully Open-source" but omits specific license details (e.g., MIT, Apache). This lack of explicit licensing information presents a significant barrier for assessing commercial use compatibility and potential adoption restrictions.
Limitations & Caveats
Training and fine-tuning code are explicitly marked as "TODO" and are not yet released, limiting immediate user experimentation beyond inference. The README's use of future dates for model releases warrants careful consideration regarding the project's current development stage and stability.
3 weeks ago
Inactive