minimind-o by jingyaogong

A 0.1B Omni model for multimodal AI

Created 2 months ago

2,065 stars

Top 20.9% on SourcePulse

Project Summary

Summary

This project addresses the scarcity of small, end-to-end Omni models trainable from scratch, targeting researchers and developers who need a transparent, lightweight baseline for multimodal AI with integrated speech capabilities. It offers a practical path to understanding, training, and modifying full Omni systems using consumer-grade hardware.

How It Works

The architecture features a Thinker-Talker dual-path design. The Thinker processes text, audio, and vision inputs, generating semantic representations. The Talker directly synthesizes streaming speech via Multi-Token Prediction (MTP) of Mimi codes, integrating speech at the hidden state level. This approach bypasses cascaded ASR-LLM-TTS pipelines, aiming for reduced latency and improved naturalness.

Quick Start & Requirements

Install: Clone the repository (git clone --depth 1 https://github.com/jingyaogong/minimind-o) and install dependencies (pip install -r requirements.txt).
Prerequisites: Python 3.10, CUDA 12.2, and an NVIDIA GPU (RTX 3090 recommended for training). External model components (SenseVoice, SigLIP2, Mimi, CAM++, base LLM) must be downloaded separately via modelscope download.
Resource Footprint: Training the 'mini' dataset completes in approximately 2 hours on a single RTX 3090. CPU inference is supported.
Links: Technical Report (arXiv:2605.03937), Online Demo (Gradio), Video Introduction, Model Collections (ModelScope, HuggingFace).

Highlighted Details

Ultra-Lightweight: Features approximately 0.1B trainable parameters (minimind-3o), positioning it as one of the smallest complete Omni implementations available.
End-to-End Training: Provides a full supervised fine-tuning (SFT) pipeline for Text-to-Audio (T2A), Image-to-Text (I2T), and Audio-to-Audio (A2A) tasks.
Real-time Speech: Supports 24 kHz streaming audio output, real-time barge-in interruption, and approximate full-duplex interaction.
In-Context Voice Cloning: Enables voice cloning using reference audio prompts, controllable via a WebUI that includes a phone mode.
Native PyTorch Implementation: Core algorithms are implemented from scratch in native PyTorch, avoiding reliance on high-level framework abstractions.

Maintenance & Community

The project is primarily community-driven through GitHub issues and pull requests. No explicit community channels (e.g., Discord, Slack) or a public roadmap are detailed.

Licensing & Compatibility

Licensed under the Apache-2.0 License, which permits commercial use and integration into closed-source projects.

Limitations & Caveats

The ~0.1B model exhibits limitations in complex reasoning, knowledge recall, and open-ended English generation compared to larger models. Voice cloning is described as a beta feature with variable consistency across prompts and sentence lengths. Barge-in functionality relies on basic Voice Activity Detection (VAD) thresholds rather than semantic interruption. Chinese speech handling is noted as more challenging than English.

minimind-o by jingyaogong

Explore Similar Projects

LongCat-Flash-Omni by meituan-longcat

edgedict by theblackcat102

Stream-Omni by ictnlp

MiniCPM-o-Demo by OpenBMB

transcribe by vivekuppal

VITA-Audio by VITA-MLLM

dia2 by nari-labs

faster-qwen3-tts by andimarafioti

mini-omni2 by gpt-omni

Qwen3-Omni by QwenLM

Qwen2.5-Omni by QwenLM

moshi by kyutai-labs