Stream-Omni by ictnlp

GPT-4o-like multimodal chatbot

Created 8 months ago

384 stars

Top 74.7% on SourcePulse

Project Summary

Stream-Omni is an open-source multimodal chatbot designed for simultaneous interaction across text, vision, and speech modalities, mimicking GPT-4o's capabilities. It targets researchers and developers building advanced conversational AI systems, offering a unified framework for complex, multi-input/output interactions.

How It Works

Stream-Omni achieves multimodal alignment through sequence-dimension concatenation for vision-text and layer-dimension mapping for speech-text. This approach enables a seamless "see-while-hear" experience by simultaneously outputting intermediate textual results (like ASR transcriptions) during speech interactions, alongside the final text and speech responses.

Quick Start & Requirements

Installation: Requires Python 3.10, flash-attn, and other dependencies listed in requirements.txt. A Conda environment is recommended.
Models: Download Stream-Omni checkpoints and CosyVoice models.
Launch: Run controller, worker, and Gradio interface scripts.
Hardware: --load-8bit option available for VRAM < 32GB.
Demo: Vision-grounded Speech Interaction Demo
API: api.py

Highlighted Details

Supports omni-modal inputs (text, vision, speech) and outputs (text, speech).
Enables simultaneous intermediate textual outputs during speech interactions.
Requires only a small amount of omni-modal data for training.
Built upon LLaVA and LLaVA-NeXT, incorporating CosyVoice for speech.

Maintenance & Community

Primary contact: zhangshaolei20z@ict.ac.cn.
Evaluation scripts available in ./scripts/stream_omni/.
Constructs SpokenVisIT benchmark for vision-grounded speech interaction evaluation.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.
Dependencies include CosyVoice, which may have its own licensing terms.

Limitations & Caveats

The README does not specify the exact license for Stream-Omni itself, which could impact commercial use. It also relies on external models like CosyVoice, whose compatibility and licensing must be independently verified.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days