Audio/speech LLM for perception and understanding, supporting open-ended questions
Top 68.2% on sourcepulse
This repository provides the official PyTorch implementation, datasets, and pretrained checkpoints for LTU and LTU-AS, the first generation of large language models for audio and speech understanding. These models can perform state-of-the-art on closed-ended audio tasks and answer open-ended questions about audio content, targeting researchers and developers in audio and speech AI.
How It Works
LTU and LTU-AS are built upon a large language model architecture, integrating audio and speech perception with natural language understanding. LTU-AS, the second generation, specifically incorporates Whisper features, enabling it to process both speech and general audio. This approach allows for a unified framework to tackle diverse audio-related tasks, from classification to open-ended question answering.
Quick Start & Requirements
requirements.txt
, and installing customized Hugging Face transformers
and peft
packages from the provided hf-dev
and peft-main
directories../inference.sh
).prep_train.sh
for data/model preparation and finetune_toy.sh
or finetune_toy_low_resource.sh
for training.Highlighted Details
Maintenance & Community
The project is led by Yuan Gong and associated with MIT and MIT-IBM Watson AI Lab. Issues can be raised on GitHub for prompt responses.
Licensing & Compatibility
The repository does not explicitly state a license. The datasets are provided for research purposes, with audio files sourced from public datasets.
Limitations & Caveats
LTU-AS performance can be affected by differences in Whisper feature generation across GPUs; specific checkpoints are provided for older and newer GPU architectures. The raw audio files for the datasets are not provided due to copyright, requiring users to download them separately.
1 year ago
1 day