Music captioning research paper using LLMs
Top 83.0% on sourcepulse
LP-MusicCaps provides a framework for generating descriptive captions for music, targeting researchers and developers in music information retrieval and AI-driven content creation. It offers two primary methods: generating captions from text tags using LLMs and training end-to-end models for audio-to-caption generation, aiming for human-level captioning quality.
How It Works
The project employs a two-stage approach. First, "Tag-to-Caption" leverages OpenAI's GPT-3.5 Turbo API to create detailed captions from user-provided music tags, enabling rich textual descriptions from metadata. Second, "Audio-to-Caption" involves training a cross-model encoder-decoder architecture. This stage first generates "pseudo captions" from audio and tags, then fine-tunes a transfer model on audio-pseudo caption pairs to achieve end-to-end music captioning directly from audio input.
Quick Start & Requirements
pip install -e .
cd lpmc/llm_captioning && python run.py --prompt {writing, summary, paraphrase, attribute_prediction} --tags <music_tags>
cd demo && python app.py
or cd lpmc/music_captioning && wget https://huggingface.co/seungheondoh/lp-music-caps/resolve/main/transfer.pth -O exp/transfer/lp_music_caps && python captioning.py --audio_path ../../dataset/samples/orchestra.wav
Highlighted Details
Maintenance & Community
The project is associated with authors from ISMIR 2023. Further details on community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
Limitations & Caveats
The CC-BY-NC 4.0 license restricts commercial use. The "Tag-to-Caption" method relies on the availability and cost of the OpenAI GPT-3.5 Turbo API.
1 year ago
1 week