Speech model for fast audio-text token generation
Top 53.8% on sourcepulse
VITA-Audio is an end-to-end large speech model designed for efficient and fast audio-text token generation. It targets researchers and developers working with speech processing, offering significant inference speedups and low latency for tasks like Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Spoken Question Answering (SQA).
How It Works
VITA-Audio employs interleaved cross-modal token generation, a novel approach that allows for the generation of audio during the initial forward pass. By using a set of prefill tokens, it drastically reduces the latency for generating the first audio chunk, achieving a 3-5x inference speedup on a 7B parameter model. This architecture is built upon open-source data, comprising 200k hours of audio.
Quick Start & Requirements
pip install -r requirements_ds_gpu.txt
and pip install -e .
.shenyunhang/pytorch:24.11-py3_2024-1224
is available.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is still under active development, with some components like cleaned open-source data yet to be released. The output of VITA-Audio has randomness, and the developers disclaim responsibility for any issues arising from its use.
2 months ago
Inactive