Stream-Omni  by ictnlp

GPT-4o-like multimodal chatbot

Created 3 months ago
343 stars

Top 80.6% on SourcePulse

GitHubView on GitHub
Project Summary

Stream-Omni is an open-source multimodal chatbot designed for simultaneous interaction across text, vision, and speech modalities, mimicking GPT-4o's capabilities. It targets researchers and developers building advanced conversational AI systems, offering a unified framework for complex, multi-input/output interactions.

How It Works

Stream-Omni achieves multimodal alignment through sequence-dimension concatenation for vision-text and layer-dimension mapping for speech-text. This approach enables a seamless "see-while-hear" experience by simultaneously outputting intermediate textual results (like ASR transcriptions) during speech interactions, alongside the final text and speech responses.

Quick Start & Requirements

  • Installation: Requires Python 3.10, flash-attn, and other dependencies listed in requirements.txt. A Conda environment is recommended.
  • Models: Download Stream-Omni checkpoints and CosyVoice models.
  • Launch: Run controller, worker, and Gradio interface scripts.
  • Hardware: --load-8bit option available for VRAM < 32GB.
  • Demo: Vision-grounded Speech Interaction Demo
  • API: api.py

Highlighted Details

  • Supports omni-modal inputs (text, vision, speech) and outputs (text, speech).
  • Enables simultaneous intermediate textual outputs during speech interactions.
  • Requires only a small amount of omni-modal data for training.
  • Built upon LLaVA and LLaVA-NeXT, incorporating CosyVoice for speech.

Maintenance & Community

  • Primary contact: zhangshaolei20z@ict.ac.cn.
  • Evaluation scripts available in ./scripts/stream_omni/.
  • Constructs SpokenVisIT benchmark for vision-grounded speech interaction evaluation.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README.
  • Dependencies include CosyVoice, which may have its own licensing terms.

Limitations & Caveats

The README does not specify the exact license for Stream-Omni itself, which could impact commercial use. It also relies on external models like CosyVoice, whose compatibility and licensing must be independently verified.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.