Discover and explore top open-source AI tools and projects—updated daily.
GPT-4o-like multimodal chatbot
Top 80.6% on SourcePulse
Stream-Omni is an open-source multimodal chatbot designed for simultaneous interaction across text, vision, and speech modalities, mimicking GPT-4o's capabilities. It targets researchers and developers building advanced conversational AI systems, offering a unified framework for complex, multi-input/output interactions.
How It Works
Stream-Omni achieves multimodal alignment through sequence-dimension concatenation for vision-text and layer-dimension mapping for speech-text. This approach enables a seamless "see-while-hear" experience by simultaneously outputting intermediate textual results (like ASR transcriptions) during speech interactions, alongside the final text and speech responses.
Quick Start & Requirements
flash-attn
, and other dependencies listed in requirements.txt
. A Conda environment is recommended.--load-8bit
option available for VRAM < 32GB.Highlighted Details
Maintenance & Community
./scripts/stream_omni/
.Licensing & Compatibility
Limitations & Caveats
The README does not specify the exact license for Stream-Omni itself, which could impact commercial use. It also relies on external models like CosyVoice, whose compatibility and licensing must be independently verified.
3 months ago
Inactive