Stream-Omni  by ictnlp

GPT-4o-like multimodal chatbot

Created 11 months ago
386 stars

Top 73.9% on SourcePulse

GitHubView on GitHub
Project Summary

Stream-Omni is an open-source multimodal chatbot designed for simultaneous interaction across text, vision, and speech modalities, mimicking GPT-4o's capabilities. It targets researchers and developers building advanced conversational AI systems, offering a unified framework for complex, multi-input/output interactions.

How It Works

Stream-Omni achieves multimodal alignment through sequence-dimension concatenation for vision-text and layer-dimension mapping for speech-text. This approach enables a seamless "see-while-hear" experience by simultaneously outputting intermediate textual results (like ASR transcriptions) during speech interactions, alongside the final text and speech responses.

Quick Start & Requirements

  • Installation: Requires Python 3.10, flash-attn, and other dependencies listed in requirements.txt. A Conda environment is recommended.
  • Models: Download Stream-Omni checkpoints and CosyVoice models.
  • Launch: Run controller, worker, and Gradio interface scripts.
  • Hardware: --load-8bit option available for VRAM < 32GB.
  • Demo: Vision-grounded Speech Interaction Demo
  • API: api.py

Highlighted Details

  • Supports omni-modal inputs (text, vision, speech) and outputs (text, speech).
  • Enables simultaneous intermediate textual outputs during speech interactions.
  • Requires only a small amount of omni-modal data for training.
  • Built upon LLaVA and LLaVA-NeXT, incorporating CosyVoice for speech.

Maintenance & Community

  • Primary contact: zhangshaolei20z@ict.ac.cn.
  • Evaluation scripts available in ./scripts/stream_omni/.
  • Constructs SpokenVisIT benchmark for vision-grounded speech interaction evaluation.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README.
  • Dependencies include CosyVoice, which may have its own licensing terms.

Limitations & Caveats

The README does not specify the exact license for Stream-Omni itself, which could impact commercial use. It also relies on external models like CosyVoice, whose compatibility and licensing must be independently verified.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.