minimind-o  by jingyaogong

A 0.1B Omni model for multimodal AI

Created 3 weeks ago

New!

1,589 stars

Top 25.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project addresses the scarcity of small, end-to-end Omni models trainable from scratch, targeting researchers and developers who need a transparent, lightweight baseline for multimodal AI with integrated speech capabilities. It offers a practical path to understanding, training, and modifying full Omni systems using consumer-grade hardware.

How It Works

The architecture features a Thinker-Talker dual-path design. The Thinker processes text, audio, and vision inputs, generating semantic representations. The Talker directly synthesizes streaming speech via Multi-Token Prediction (MTP) of Mimi codes, integrating speech at the hidden state level. This approach bypasses cascaded ASR-LLM-TTS pipelines, aiming for reduced latency and improved naturalness.

Quick Start & Requirements

  • Install: Clone the repository (git clone --depth 1 https://github.com/jingyaogong/minimind-o) and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.10, CUDA 12.2, and an NVIDIA GPU (RTX 3090 recommended for training). External model components (SenseVoice, SigLIP2, Mimi, CAM++, base LLM) must be downloaded separately via modelscope download.
  • Resource Footprint: Training the 'mini' dataset completes in approximately 2 hours on a single RTX 3090. CPU inference is supported.
  • Links: Technical Report (arXiv:2605.03937), Online Demo (Gradio), Video Introduction, Model Collections (ModelScope, HuggingFace).

Highlighted Details

  • Ultra-Lightweight: Features approximately 0.1B trainable parameters (minimind-3o), positioning it as one of the smallest complete Omni implementations available.
  • End-to-End Training: Provides a full supervised fine-tuning (SFT) pipeline for Text-to-Audio (T2A), Image-to-Text (I2T), and Audio-to-Audio (A2A) tasks.
  • Real-time Speech: Supports 24 kHz streaming audio output, real-time barge-in interruption, and approximate full-duplex interaction.
  • In-Context Voice Cloning: Enables voice cloning using reference audio prompts, controllable via a WebUI that includes a phone mode.
  • Native PyTorch Implementation: Core algorithms are implemented from scratch in native PyTorch, avoiding reliance on high-level framework abstractions.

Maintenance & Community

The project is primarily community-driven through GitHub issues and pull requests. No explicit community channels (e.g., Discord, Slack) or a public roadmap are detailed.

Licensing & Compatibility

Licensed under the Apache-2.0 License, which permits commercial use and integration into closed-source projects.

Limitations & Caveats

The ~0.1B model exhibits limitations in complex reasoning, knowledge recall, and open-ended English generation compared to larger models. Voice cloning is described as a beta feature with variable consistency across prompts and sentence lengths. Barge-in functionality relies on basic Voice Activity Detection (VAD) thresholds rather than semantic interruption. Chinese speech handling is noted as more challenging than English.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
7
Star History
1,600 stars in the last 26 days

Explore Similar Projects

Feedback? Help us improve.