ltu  by YuanGongND

Audio/speech LLM for perception and understanding, supporting open-ended questions

created 2 years ago
447 stars

Top 68.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official PyTorch implementation, datasets, and pretrained checkpoints for LTU and LTU-AS, the first generation of large language models for audio and speech understanding. These models can perform state-of-the-art on closed-ended audio tasks and answer open-ended questions about audio content, targeting researchers and developers in audio and speech AI.

How It Works

LTU and LTU-AS are built upon a large language model architecture, integrating audio and speech perception with natural language understanding. LTU-AS, the second generation, specifically incorporates Whisper features, enabling it to process both speech and general audio. This approach allows for a unified framework to tackle diverse audio-related tasks, from classification to open-ended question answering.

Quick Start & Requirements

  • Installation: Requires setting up separate Conda environments for LTU and LTU-AS, installing dependencies via requirements.txt, and installing customized Hugging Face transformers and peft packages from the provided hf-dev and peft-main directories.
  • Prerequisites: Python 3.10, PyTorch. LTU-AS requires Whisper feature extraction.
  • Inference: Can be performed via HuggingFace Spaces, a Gradio API, or locally. Local inference requires activating the respective Conda environment and running provided shell scripts (./inference.sh).
  • Finetuning: Requires prep_train.sh for data/model preparation and finetune_toy.sh or finetune_toy_low_resource.sh for training.
  • Resources: Training requires significant GPU VRAM (e.g., 4x A6000). Inference is possible on CPU but requires substantial VRAM on GPU (2x TitanX for LTU, 4x TitanX for LTU-AS).
  • Links: LTU Interactive Demo, LTU-AS Interactive Demo

Highlighted Details

  • Supports both closed-ended and open-ended question answering on audio.
  • LTU-AS leverages Whisper features for enhanced speech and audio understanding.
  • Provides comprehensive datasets (OpenAQA, OpenASQA) and multiple pretrained checkpoints, including LLaMA 13B variants.
  • Detailed instructions for fine-tuning and reproducing training stages are included.

Maintenance & Community

The project is led by Yuan Gong and associated with MIT and MIT-IBM Watson AI Lab. Issues can be raised on GitHub for prompt responses.

Licensing & Compatibility

The repository does not explicitly state a license. The datasets are provided for research purposes, with audio files sourced from public datasets.

Limitations & Caveats

LTU-AS performance can be affected by differences in Whisper feature generation across GPUs; specific checkpoints are provided for older and newer GPU architectures. The raw audio files for the datasets are not provided due to copyright, requiring users to download them separately.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.