VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 10 months ago

673 stars

Top 50.1% on SourcePulse

Project Summary

VITA-Audio is an end-to-end large speech model designed for efficient and fast audio-text token generation. It targets researchers and developers working with speech processing, offering significant inference speedups and low latency for tasks like Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Spoken Question Answering (SQA).

How It Works

VITA-Audio employs interleaved cross-modal token generation, a novel approach that allows for the generation of audio during the initial forward pass. By using a set of prefill tokens, it drastically reduces the latency for generating the first audio chunk, achieving a 3-5x inference speedup on a 7B parameter model. This architecture is built upon open-source data, comprising 200k hours of audio.

Quick Start & Requirements

Installation: Clone the repository, update submodules, and install requirements using pip install -r requirements_ds_gpu.txt and pip install -e ..
Prerequisites: Requires a GPU environment. Pre-trained weights for a Large Language Model (e.g., Qwen2.5-7B-Instruct), an Audio Encoder (glm-4-voice-tokenizer), and an Audio Decoder (glm-4-voice-decoder) must be downloaded and placed in specified directories.
Setup: Docker image shenyunhang/pytorch:24.11-py3_2024-1224 is available.
Documentation: VITA-Audio Paper

Highlighted Details

Achieves 53ms latency for the first audio token chunk generation, down from 236ms.
Offers 3-5x inference speedup at the 7B parameter scale.
Trained on 200k hours of open-source audio data.
Demonstrates competitive performance on ASR, TTS, and SQA benchmarks.

Maintenance & Community

The project is actively being developed, with plans to release cleaned data.
Model weights are available on Huggingface (e.g., VITA-Audio-Boost, VITA-Audio-Balance).
Contact information includes a WeChat group.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is still under active development, with some components like cleaned open-source data yet to be released. The output of VITA-Audio has randomness, and the developers disclaim responsibility for any issues arising from its use.

VITA-Audio by VITA-MLLM

Explore Similar Projects

UniAudio2 by yangdongchao

pheme by PolyAI-LDN

Meta-voicebox by SpeechifyInc

SpeechGPT-2.0-preview by OpenMOSS

csm-mlx by senstella

UniAudio by yangdongchao

edgedict by theblackcat102

dia2 by nari-labs

fast-voice-assistant by dsa

Kimi-Audio by MoonshotAI

higgs-audio by boson-ai

KittenTTS by KittenML