Multimodal LLM for real-time vision and speech interaction
Top 19.8% on sourcepulse
VITA-1.5 is an open-source interactive omni-multimodal LLM designed for real-time vision and speech interaction, targeting researchers and developers in multimodal AI. It significantly reduces interaction latency and enhances multimodal performance, aiming for GPT-4o level capabilities.
How It Works
VITA-1.5 builds upon VITA-1.0 by incorporating advancements in speech processing and multimodal integration. It features a reduced end-to-end speech interaction latency (down to 1.5 seconds from 4 seconds) and improved ASR Word Error Rate (from 18.4% to 7.5%). A key innovation is the replacement of VITA-1.0's independent TTS module with an end-to-end module that accepts LLM embeddings. A progressive training strategy ensures that adding audio modality has minimal impact on vision-language performance.
Quick Start & Requirements
conda create -n vita python=3.10
), activate it, and install requirements (pip install -r requirements.txt
, pip install flash-attn --no-build-isolation
).Highlighted Details
Maintenance & Community
The project has released a technical report for VITA-1.5 and supports evaluation via VLMEvalKit. Related works and acknowledgments include LLaVA-1.5, InternViT, and Qwen-2.5.
Licensing & Compatibility
The repository does not explicitly state a license in the README.
Limitations & Caveats
The real-time interactive demo requires manual configuration of vLLM and VAD modules. The project is trained on open-source corpus, and generated content is subject to randomness and does not represent developer views.
4 months ago
1 day