VITA-Audio  by VITA-MLLM

Speech model for fast audio-text token generation

Created 4 months ago
634 stars

Top 52.3% on SourcePulse

GitHubView on GitHub
Project Summary

VITA-Audio is an end-to-end large speech model designed for efficient and fast audio-text token generation. It targets researchers and developers working with speech processing, offering significant inference speedups and low latency for tasks like Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Spoken Question Answering (SQA).

How It Works

VITA-Audio employs interleaved cross-modal token generation, a novel approach that allows for the generation of audio during the initial forward pass. By using a set of prefill tokens, it drastically reduces the latency for generating the first audio chunk, achieving a 3-5x inference speedup on a 7B parameter model. This architecture is built upon open-source data, comprising 200k hours of audio.

Quick Start & Requirements

  • Installation: Clone the repository, update submodules, and install requirements using pip install -r requirements_ds_gpu.txt and pip install -e ..
  • Prerequisites: Requires a GPU environment. Pre-trained weights for a Large Language Model (e.g., Qwen2.5-7B-Instruct), an Audio Encoder (glm-4-voice-tokenizer), and an Audio Decoder (glm-4-voice-decoder) must be downloaded and placed in specified directories.
  • Setup: Docker image shenyunhang/pytorch:24.11-py3_2024-1224 is available.
  • Documentation: VITA-Audio Paper

Highlighted Details

  • Achieves 53ms latency for the first audio token chunk generation, down from 236ms.
  • Offers 3-5x inference speedup at the 7B parameter scale.
  • Trained on 200k hours of open-source audio data.
  • Demonstrates competitive performance on ASR, TTS, and SQA benchmarks.

Maintenance & Community

  • The project is actively being developed, with plans to release cleaned data.
  • Model weights are available on Huggingface (e.g., VITA-Audio-Boost, VITA-Audio-Balance).
  • Contact information includes a WeChat group.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development, with some components like cleaned open-source data yet to be released. The output of VITA-Audio has randomness, and the developers disclaim responsibility for any issues arising from its use.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.