VITA  by VITA-MLLM

Multimodal LLM for real-time vision and speech interaction

Created 1 year ago
2,511 stars

Top 17.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VITA-1.5 is an open-source interactive omni-multimodal LLM designed for real-time vision and speech interaction, targeting researchers and developers in multimodal AI. It significantly reduces interaction latency and enhances multimodal performance, aiming for GPT-4o level capabilities.

How It Works

VITA-1.5 builds upon VITA-1.0 by incorporating advancements in speech processing and multimodal integration. It features a reduced end-to-end speech interaction latency (down to 1.5 seconds from 4 seconds) and improved ASR Word Error Rate (from 18.4% to 7.5%). A key innovation is the replacement of VITA-1.0's independent TTS module with an end-to-end module that accepts LLM embeddings. A progressive training strategy ensures that adding audio modality has minimal impact on vision-language performance.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n vita python=3.10), activate it, and install requirements (pip install -r requirements.txt, pip install flash-attn --no-build-isolation).
  • Prerequisites: Python 3.10, CUDA, flash-attn.
  • Demo: Basic demo available on ModelScope. Real-time interactive demo requires additional setup including VAD modules and modifications to vLLM.
  • Links: VITA-1.5 Paper, Basic Demo, Real-Time Demo

Highlighted Details

  • Achieves 1.5-second end-to-end speech interaction latency.
  • Improves average multimodal benchmark performance from 59.8 to 70.8.
  • Reduces ASR WER from 18.4% to 7.5%.
  • Integrates an end-to-end TTS module.

Maintenance & Community

The project has released a technical report for VITA-1.5 and supports evaluation via VLMEvalKit. Related works and acknowledgments include LLaVA-1.5, InternViT, and Qwen-2.5.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The real-time interactive demo requires manual configuration of vLLM and VAD modules. The project is trained on open-source corpus, and generated content is subject to randomness and does not represent developer views.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

ultravox by fixie-ai

0.1%
4k
Multimodal LLM for real-time voice interactions
Created 2 years ago
Updated 5 months ago
Feedback? Help us improve.