ltu by YuanGongND

Audio/speech LLM for perception and understanding, supporting open-ended questions

Created 2 years ago

464 stars

Top 65.3% on SourcePulse

Project Summary

This repository provides the official PyTorch implementation, datasets, and pretrained checkpoints for LTU and LTU-AS, the first generation of large language models for audio and speech understanding. These models can perform state-of-the-art on closed-ended audio tasks and answer open-ended questions about audio content, targeting researchers and developers in audio and speech AI.

How It Works

LTU and LTU-AS are built upon a large language model architecture, integrating audio and speech perception with natural language understanding. LTU-AS, the second generation, specifically incorporates Whisper features, enabling it to process both speech and general audio. This approach allows for a unified framework to tackle diverse audio-related tasks, from classification to open-ended question answering.

Quick Start & Requirements

Installation: Requires setting up separate Conda environments for LTU and LTU-AS, installing dependencies via requirements.txt, and installing customized Hugging Face transformers and peft packages from the provided hf-dev and peft-main directories.
Prerequisites: Python 3.10, PyTorch. LTU-AS requires Whisper feature extraction.
Inference: Can be performed via HuggingFace Spaces, a Gradio API, or locally. Local inference requires activating the respective Conda environment and running provided shell scripts (./inference.sh).
Finetuning: Requires prep_train.sh for data/model preparation and finetune_toy.sh or finetune_toy_low_resource.sh for training.
Resources: Training requires significant GPU VRAM (e.g., 4x A6000). Inference is possible on CPU but requires substantial VRAM on GPU (2x TitanX for LTU, 4x TitanX for LTU-AS).
Links: LTU Interactive Demo, LTU-AS Interactive Demo

Highlighted Details

Supports both closed-ended and open-ended question answering on audio.
LTU-AS leverages Whisper features for enhanced speech and audio understanding.
Provides comprehensive datasets (OpenAQA, OpenASQA) and multiple pretrained checkpoints, including LLaMA 13B variants.
Detailed instructions for fine-tuning and reproducing training stages are included.

Maintenance & Community

The project is led by Yuan Gong and associated with MIT and MIT-IBM Watson AI Lab. Issues can be raised on GitHub for prompt responses.

Licensing & Compatibility

The repository does not explicitly state a license. The datasets are provided for research purposes, with audio files sourced from public datasets.

Limitations & Caveats

LTU-AS performance can be affected by differences in Whisper feature generation across GPUs; specific checkpoints are provided for older and newer GPU architectures. The raw audio files for the datasets are not provided due to copyright, requiring users to download them separately.

ltu by YuanGongND

Explore Similar Projects

OSUM by ASLP-lab

UniAudio by yangdongchao

Awesome-Audio-LLM by AudioLLMs

VITA-Audio by VITA-MLLM

dia2 by nari-labs

SALMONN by bytedance

SoundMind by xid32

Qwen-Audio by QwenLM

QuickAgent by gkamradt

Kimi-Audio by MoonshotAI

ultravox by fixie-ai

dia by nari-labs