Discover and explore top open-source AI tools and projects—updated daily.
jingyaogongA 0.1B Omni model for multimodal AI
New!
Top 25.8% on SourcePulse
Summary
This project addresses the scarcity of small, end-to-end Omni models trainable from scratch, targeting researchers and developers who need a transparent, lightweight baseline for multimodal AI with integrated speech capabilities. It offers a practical path to understanding, training, and modifying full Omni systems using consumer-grade hardware.
How It Works
The architecture features a Thinker-Talker dual-path design. The Thinker processes text, audio, and vision inputs, generating semantic representations. The Talker directly synthesizes streaming speech via Multi-Token Prediction (MTP) of Mimi codes, integrating speech at the hidden state level. This approach bypasses cascaded ASR-LLM-TTS pipelines, aiming for reduced latency and improved naturalness.
Quick Start & Requirements
git clone --depth 1 https://github.com/jingyaogong/minimind-o) and install dependencies (pip install -r requirements.txt).modelscope download.Highlighted Details
minimind-3o), positioning it as one of the smallest complete Omni implementations available.Maintenance & Community
The project is primarily community-driven through GitHub issues and pull requests. No explicit community channels (e.g., Discord, Slack) or a public roadmap are detailed.
Licensing & Compatibility
Licensed under the Apache-2.0 License, which permits commercial use and integration into closed-source projects.
Limitations & Caveats
The ~0.1B model exhibits limitations in complex reasoning, knowledge recall, and open-ended English generation compared to larger models. Voice cloning is described as a beta feature with variable consistency across prompts and sentence lengths. Barge-in functionality relies on basic Voice Activity Detection (VAD) thresholds rather than semantic interruption. Chinese speech handling is noted as more challenging than English.
1 week ago
Inactive
kyutai-labs