lp-music-caps by seungheondoh

Music captioning research paper using LLMs

Created 2 years ago

343 stars

Top 80.6% on SourcePulse

1 Expert Loves This Project

osanseviero

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

LP-MusicCaps provides a framework for generating descriptive captions for music, targeting researchers and developers in music information retrieval and AI-driven content creation. It offers two primary methods: generating captions from text tags using LLMs and training end-to-end models for audio-to-caption generation, aiming for human-level captioning quality.

How It Works

The project employs a two-stage approach. First, "Tag-to-Caption" leverages OpenAI's GPT-3.5 Turbo API to create detailed captions from user-provided music tags, enabling rich textual descriptions from metadata. Second, "Audio-to-Caption" involves training a cross-model encoder-decoder architecture. This stage first generates "pseudo captions" from audio and tags, then fine-tunes a transfer model on audio-pseudo caption pairs to achieve end-to-end music captioning directly from audio input.

Quick Start & Requirements

Installation: pip install -e .
Prerequisites: Python 3.10, PyTorch 1.13.1 (ensure CUDA compatibility), OpenAI API key.
Quick Start:
- Tag-to-Caption: cd lpmc/llm_captioning && python run.py --prompt {writing, summary, paraphrase, attribute_prediction} --tags <music_tags>
- Audio-to-Caption: cd demo && python app.py or cd lpmc/music_captioning && wget https://huggingface.co/seungheondoh/lp-music-caps/resolve/main/transfer.pth -O exp/transfer/lp_music_caps && python captioning.py --audio_path ../../dataset/samples/orchestra.wav
Resources: Pre-trained models and datasets are available on Huggingface.

Highlighted Details

Paper nominated for ISMIR Best Paper Award (5/104).
Invited for TISMIR journal publication.
Provides pre-trained models, transfer models, and a music/pseudo-caption dataset.
Includes a Huggingface demo for immediate interaction.

Maintenance & Community

The project is associated with authors from ISMIR 2023. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

License: CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
Compatibility: Non-commercial use only due to the NC clause.

Limitations & Caveats

The CC-BY-NC 4.0 license restricts commercial use. The "Tag-to-Caption" method relies on the availability and cost of the OpenAI GPT-3.5 Turbo API.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

WavCaps by XinhaoMei

Large-scale audio-language dataset and multimodal research tools

Created 2 years ago

Updated 1 year ago

MU-LLaMA by shansongliu

Music understanding model for question answering and captioning

Created 2 years ago

Updated 4 months ago

SongGen by LiuZH-19

Text-to-song generation with an auto-regressive transformer

Created 11 months ago

Updated 2 months ago

mustango by AMAAI-Lab

Text-to-music generation research paper using multimodal LLMs

Created 2 years ago

Updated 7 months ago

dasheng-lm by xiaomi-research

Efficient audio understanding with general audio captions

Created 5 months ago

Updated 2 months ago

WavJourney by Audio-AGI

Audio creation pipeline using LLMs for compositional generation

Created 2 years ago

Updated 2 years ago

awesome-large-audio-models by EmulationAI

Curated list of Large Language Models in Audio AI

Created 2 years ago

Updated 2 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Keenan Freyberg

Keenan Freyberg(Cofounder of Suno), and

2 more.

audio-ai-timeline by archinetai

AI model timeline for audio generation

Created 2 years ago

Updated 2 years ago

SongGeneration by tencent-ailab

AI framework for high-fidelity song generation

Created 7 months ago

Updated 4 weeks ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 5 months ago

Updated 3 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI),

Meng Zhang

Meng Zhang(Cofounder of TabbyML), and

1 more.

YuE by multimodal-art-projection

Open-source tool for generating full songs from lyrics

Created 11 months ago

Updated 7 months ago

Starred by

Christian Laforte

Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

1 more.

Amphion by open-mmlab

Toolkit for audio, music, and speech generation research

Created 2 years ago

Updated 7 months ago

Feedback? Help us improve.