Text-to-music generation research paper using multimodal LLMs
Top 77.3% on sourcepulse
Mustango is an open-source project enabling controllable text-to-music generation. It targets researchers and developers interested in AI-driven music creation, offering a Latent Diffusion Model (LDM) approach for generating music from detailed textual descriptions.
How It Works
Mustango combines a Latent Diffusion Model (LDM) with the Flan-T5 language model and specific musical features. This architecture allows for fine-grained control over the generated music based on textual prompts, aiming for higher fidelity and musical coherence compared to previous methods. The use of LDM is advantageous for its efficiency in generating high-resolution outputs.
Quick Start & Requirements
cd
into it, and run pip install -r requirements.txt
. Then navigate to the diffusers
directory and run pip install -e .
.soundfile
, IPython
, Hugging Face diffusers
, and accelerate
for training.Highlighted Details
accelerate
package.Maintenance & Community
The project is associated with AMAAI-Lab. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. However, the use of Hugging Face diffusers
and models implies adherence to their respective licenses, which are generally permissive for research and commercial use, but users should verify.
Limitations & Caveats
Training from scratch on the MusicBench dataset is recommended to take at least 40 epochs, indicating a significant computational requirement for retraining. The project is presented as research, and stability for production use is not guaranteed.
2 months ago
1 day