Easy Voice Toolkit is a user-friendly, locally deployable AI audio processing suite for voice recognition, transcription, and conversion. It targets users who need an integrated workflow for audio manipulation, from raw files to speech models, with a focus on voice conversion.
How It Works
The toolkit integrates several open-source projects, including audio-slicer, VoiceprintRecognition, whisper, and GPT-SoVITS. It provides a modular approach, allowing users to select specific tools or chain them for a complete voice conversion workflow. This architecture facilitates a gradual transformation of audio files into speech models.
Quick Start & Requirements
- Installation: Clone the repository with submodules (
git clone --recurse-submodules
) and install dependencies using pip install -r requirements.txt
.
- Prerequisites: Python 3.8+, PyTorch (with CUDA 11.8 example provided).
- System: Currently supports Windows only.
- Resources: A "Ready-to-use portable package" is available for easier setup, containing all dependencies and models.
- Demo: Google Colab demo available.
Highlighted Details
- Supports Chinese, English, and Japanese for most functions.
- Includes tools for audio processing, voice recognition, transcription, dataset creation, model training, and voice conversion.
- Offers both lightweight installer and a large, ready-to-use portable package.
- Future features include LLM integration and a C++ (Qt) client refactor.
Maintenance & Community
- Active development indicated by "WIP" for backend development.
- Contact details provided for feedback and suggestions.
Licensing & Compatibility
- The project is free and open-source ("Natürlich~♪").
- Users are responsible for dataset authorization.
- Distribution or public sharing requires attribution to the original author and source.
- Not intended for production environments.
Limitations & Caveats
- Currently limited to Windows OS.
- Users must manage dataset authorization and are solely responsible for any infringement issues.
- Distribution terms require clear indication of voice changing usage and input source details.