WavJourney  by Audio-AGI

Audio creation pipeline using LLMs for compositional generation

created 2 years ago
539 stars

Top 59.7% on sourcepulse

GitHubView on GitHub
Project Summary

WavJourney enables compositional audio creation from text prompts, targeting multimedia storytellers and content creators. It generates integrated audio experiences featuring custom speakers, contextual speech, music, and sound effects, aiming to enhance auditory storytelling.

How It Works

WavJourney leverages Large Language Models (LLMs) to orchestrate a pipeline of specialized audio generation models. It breaks down a text prompt into a structured script, assigning roles and emotional cues. This script then drives separate Text-to-Speech (TTS), music generation, and sound effect models, composing them into a coherent audio narrative. This compositional approach allows for fine-grained control and contextually relevant audio elements.

Quick Start & Requirements

  • Install via bash ./scripts/EnvsSetup.sh and activate with conda activate WavJourney.
  • Requires a GPU with >16 GB VRAM and Linux OS.
  • An OpenAI API key (WAVJOURNEY_OPENAI_KEY) is necessary for GPT-4 access.
  • Pre-download models using python scripts/download_models.py.
  • Services can be started with bash scripts/start_services.sh, and the UI with bash scripts/start_ui.sh.
  • Command-line usage: python wavjourney_cli.py -f --input-text "...".
  • Official documentation and community links are available via Discord and HuggingFace.

Highlighted Details

  • Supports speaker customization with voice presets.
  • Integrates with state-of-the-art models like Bark (TTS) and AudioCraft.
  • Offers both command-line interface and a Web UI.
  • Enables programmatic control via API services.

Maintenance & Community

The project is actively seeking research and commercial cooperation. Community interaction is encouraged via Discord and HuggingFace.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The default configuration requires a Linux operating system and a GPU with over 16GB of VRAM. The project relies on external API services, including GPT-4, which may incur costs and introduce external dependencies.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Thomas Wolf Thomas Wolf(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

ultravox by fixie-ai

0.4%
4k
Multimodal LLM for real-time voice interactions
created 1 year ago
updated 5 days ago
Feedback? Help us improve.