Audio creation pipeline using LLMs for compositional generation
Top 59.7% on sourcepulse
WavJourney enables compositional audio creation from text prompts, targeting multimedia storytellers and content creators. It generates integrated audio experiences featuring custom speakers, contextual speech, music, and sound effects, aiming to enhance auditory storytelling.
How It Works
WavJourney leverages Large Language Models (LLMs) to orchestrate a pipeline of specialized audio generation models. It breaks down a text prompt into a structured script, assigning roles and emotional cues. This script then drives separate Text-to-Speech (TTS), music generation, and sound effect models, composing them into a coherent audio narrative. This compositional approach allows for fine-grained control and contextually relevant audio elements.
Quick Start & Requirements
bash ./scripts/EnvsSetup.sh
and activate with conda activate WavJourney
.WAVJOURNEY_OPENAI_KEY
) is necessary for GPT-4 access.python scripts/download_models.py
.bash scripts/start_services.sh
, and the UI with bash scripts/start_ui.sh
.python wavjourney_cli.py -f --input-text "..."
.Highlighted Details
Maintenance & Community
The project is actively seeking research and commercial cooperation. Community interaction is encouraged via Discord and HuggingFace.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The default configuration requires a Linux operating system and a GPU with over 16GB of VRAM. The project relies on external API services, including GPT-4, which may incur costs and introduce external dependencies.
1 year ago
1+ week