Discover and explore top open-source AI tools and projects—updated daily.
YuanGongNDAudio/speech LLM for perception and understanding, supporting open-ended questions
Top 65.9% on SourcePulse
This repository provides the official PyTorch implementation, datasets, and pretrained checkpoints for LTU and LTU-AS, the first generation of large language models for audio and speech understanding. These models can perform state-of-the-art on closed-ended audio tasks and answer open-ended questions about audio content, targeting researchers and developers in audio and speech AI.
How It Works
LTU and LTU-AS are built upon a large language model architecture, integrating audio and speech perception with natural language understanding. LTU-AS, the second generation, specifically incorporates Whisper features, enabling it to process both speech and general audio. This approach allows for a unified framework to tackle diverse audio-related tasks, from classification to open-ended question answering.
Quick Start & Requirements
requirements.txt, and installing customized Hugging Face transformers and peft packages from the provided hf-dev and peft-main directories../inference.sh).prep_train.sh for data/model preparation and finetune_toy.sh or finetune_toy_low_resource.sh for training.Highlighted Details
Maintenance & Community
The project is led by Yuan Gong and associated with MIT and MIT-IBM Watson AI Lab. Issues can be raised on GitHub for prompt responses.
Licensing & Compatibility
The repository does not explicitly state a license. The datasets are provided for research purposes, with audio files sourced from public datasets.
Limitations & Caveats
LTU-AS performance can be affected by differences in Whisper feature generation across GPUs; specific checkpoints are provided for older and newer GPU architectures. The raw audio files for the datasets are not provided due to copyright, requiring users to download them separately.
1 year ago
1 day
QwenLM
fixie-ai