WavJourney by Audio-AGI

Audio creation pipeline using LLMs for compositional generation

Created 2 years ago

541 stars

Top 58.7% on SourcePulse

Project Summary

WavJourney enables compositional audio creation from text prompts, targeting multimedia storytellers and content creators. It generates integrated audio experiences featuring custom speakers, contextual speech, music, and sound effects, aiming to enhance auditory storytelling.

How It Works

WavJourney leverages Large Language Models (LLMs) to orchestrate a pipeline of specialized audio generation models. It breaks down a text prompt into a structured script, assigning roles and emotional cues. This script then drives separate Text-to-Speech (TTS), music generation, and sound effect models, composing them into a coherent audio narrative. This compositional approach allows for fine-grained control and contextually relevant audio elements.

Quick Start & Requirements

Install via bash ./scripts/EnvsSetup.sh and activate with conda activate WavJourney.
Requires a GPU with >16 GB VRAM and Linux OS.
An OpenAI API key (WAVJOURNEY_OPENAI_KEY) is necessary for GPT-4 access.
Pre-download models using python scripts/download_models.py.
Services can be started with bash scripts/start_services.sh, and the UI with bash scripts/start_ui.sh.
Command-line usage: python wavjourney_cli.py -f --input-text "...".
Official documentation and community links are available via Discord and HuggingFace.

Highlighted Details

Supports speaker customization with voice presets.
Integrates with state-of-the-art models like Bark (TTS) and AudioCraft.
Offers both command-line interface and a Web UI.
Enables programmatic control via API services.

Maintenance & Community

The project is actively seeking research and commercial cooperation. Community interaction is encouraged via Discord and HuggingFace.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The default configuration requires a Linux operating system and a GPU with over 16GB of VRAM. The project relies on external API services, including GPT-4, which may incur costs and introduce external dependencies.

WavJourney by Audio-AGI

Explore Similar Projects

awesome-audio-plaza by metame-ai

MoonCast by jzq2000

SonicVale by xcLee001

WavCraft by JinhuaLiang

tango by declare-lab

OuteTTS by edwko

FunMusic by FunAudioLLM

PDF2Audio by lamm-mit

audiolm-pytorch by lucidrains

Amphion by open-mmlab

Zonos by Zyphra

csm by SesameAILabs