VITA  by VITA-MLLM

Multimodal LLM for real-time vision and speech interaction

created 11 months ago
2,365 stars

Top 19.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VITA-1.5 is an open-source interactive omni-multimodal LLM designed for real-time vision and speech interaction, targeting researchers and developers in multimodal AI. It significantly reduces interaction latency and enhances multimodal performance, aiming for GPT-4o level capabilities.

How It Works

VITA-1.5 builds upon VITA-1.0 by incorporating advancements in speech processing and multimodal integration. It features a reduced end-to-end speech interaction latency (down to 1.5 seconds from 4 seconds) and improved ASR Word Error Rate (from 18.4% to 7.5%). A key innovation is the replacement of VITA-1.0's independent TTS module with an end-to-end module that accepts LLM embeddings. A progressive training strategy ensures that adding audio modality has minimal impact on vision-language performance.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n vita python=3.10), activate it, and install requirements (pip install -r requirements.txt, pip install flash-attn --no-build-isolation).
  • Prerequisites: Python 3.10, CUDA, flash-attn.
  • Demo: Basic demo available on ModelScope. Real-time interactive demo requires additional setup including VAD modules and modifications to vLLM.
  • Links: VITA-1.5 Paper, Basic Demo, Real-Time Demo

Highlighted Details

  • Achieves 1.5-second end-to-end speech interaction latency.
  • Improves average multimodal benchmark performance from 59.8 to 70.8.
  • Reduces ASR WER from 18.4% to 7.5%.
  • Integrates an end-to-end TTS module.

Maintenance & Community

The project has released a technical report for VITA-1.5 and supports evaluation via VLMEvalKit. Related works and acknowledgments include LLaVA-1.5, InternViT, and Qwen-2.5.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The real-time interactive demo requires manual configuration of vLLM and VAD modules. The project is trained on open-source corpus, and generated content is subject to randomness and does not represent developer views.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
121 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.