VITA-Audio  by VITA-MLLM

Speech model for fast audio-text token generation

created 3 months ago
623 stars

Top 53.8% on sourcepulse

GitHubView on GitHub
Project Summary

VITA-Audio is an end-to-end large speech model designed for efficient and fast audio-text token generation. It targets researchers and developers working with speech processing, offering significant inference speedups and low latency for tasks like Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Spoken Question Answering (SQA).

How It Works

VITA-Audio employs interleaved cross-modal token generation, a novel approach that allows for the generation of audio during the initial forward pass. By using a set of prefill tokens, it drastically reduces the latency for generating the first audio chunk, achieving a 3-5x inference speedup on a 7B parameter model. This architecture is built upon open-source data, comprising 200k hours of audio.

Quick Start & Requirements

  • Installation: Clone the repository, update submodules, and install requirements using pip install -r requirements_ds_gpu.txt and pip install -e ..
  • Prerequisites: Requires a GPU environment. Pre-trained weights for a Large Language Model (e.g., Qwen2.5-7B-Instruct), an Audio Encoder (glm-4-voice-tokenizer), and an Audio Decoder (glm-4-voice-decoder) must be downloaded and placed in specified directories.
  • Setup: Docker image shenyunhang/pytorch:24.11-py3_2024-1224 is available.
  • Documentation: VITA-Audio Paper

Highlighted Details

  • Achieves 53ms latency for the first audio token chunk generation, down from 236ms.
  • Offers 3-5x inference speedup at the 7B parameter scale.
  • Trained on 200k hours of open-source audio data.
  • Demonstrates competitive performance on ASR, TTS, and SQA benchmarks.

Maintenance & Community

  • The project is actively being developed, with plans to release cleaned data.
  • Model weights are available on Huggingface (e.g., VITA-Audio-Boost, VITA-Audio-Balance).
  • Contact information includes a WeChat group.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development, with some components like cleaned open-source data yet to be released. The output of VITA-Audio has randomness, and the developers disclaim responsibility for any issues arising from its use.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
624 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 4 days ago
Feedback? Help us improve.