VITA by VITA-MLLM

Multimodal LLM for real-time vision and speech interaction

Created 1 year ago

2,470 stars

Top 18.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

VITA-1.5 is an open-source interactive omni-multimodal LLM designed for real-time vision and speech interaction, targeting researchers and developers in multimodal AI. It significantly reduces interaction latency and enhances multimodal performance, aiming for GPT-4o level capabilities.

How It Works

VITA-1.5 builds upon VITA-1.0 by incorporating advancements in speech processing and multimodal integration. It features a reduced end-to-end speech interaction latency (down to 1.5 seconds from 4 seconds) and improved ASR Word Error Rate (from 18.4% to 7.5%). A key innovation is the replacement of VITA-1.0's independent TTS module with an end-to-end module that accepts LLM embeddings. A progressive training strategy ensures that adding audio modality has minimal impact on vision-language performance.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n vita python=3.10), activate it, and install requirements (pip install -r requirements.txt, pip install flash-attn --no-build-isolation).
Prerequisites: Python 3.10, CUDA, flash-attn.
Demo: Basic demo available on ModelScope. Real-time interactive demo requires additional setup including VAD modules and modifications to vLLM.
Links: VITA-1.5 Paper, Basic Demo, Real-Time Demo

Highlighted Details

Achieves 1.5-second end-to-end speech interaction latency.
Improves average multimodal benchmark performance from 59.8 to 70.8.
Reduces ASR WER from 18.4% to 7.5%.
Integrates an end-to-end TTS module.

Maintenance & Community

The project has released a technical report for VITA-1.5 and supports evaluation via VLMEvalKit. Related works and acknowledgments include LLaVA-1.5, InternViT, and Qwen-2.5.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The real-time interactive demo requires manual configuration of vLLM and VAD modules. The project is trained on open-source corpus, and generated content is subject to randomness and does not represent developer views.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days