zamia-speech by gooofy

Speech tools/data for cloudless ASR, plus TTS voice training

Created 9 years ago

447 stars

Top 67.1% on SourcePulse

Project Summary

This repository provides open-source tools and data for building cloudless Automatic Speech Recognition (ASR) systems. It targets developers and researchers in Natural Language Processing (NLP) who need to create custom speech models, offering scripts for data processing, model training, and integration with Kaldi and wav2letter++.

How It Works

The project leverages various open-source speech and text corpora (e.g., VoxForge, LibriSpeech, Common Voice, Europarl) to train ASR models. It includes Python scripts for data cleaning, format conversion, noise augmentation, and language model generation using KenLM. The core ASR models supported are Kaldi's nnet3 chain and wav2letter++, with capabilities for G2P conversion using Sequitur and model adaptation.

Quick Start & Requirements

Install: Primarily Python-based scripts. Dependencies include Python 2.7 (with nltk, numpy, cython), KenLM, Kaldi, wav2letter++, sox, ffmpeg, and potentially CUDA for GPU acceleration. Binary packages are available for Debian/Raspbian and CentOS.
Setup: Requires significant data downloads (speech and text corpora) and configuration via ~/.speechrc. Estimated setup time can be substantial due to data acquisition and model training.
Demo: Pre-trained Kaldi models for English and German are available for direct use via provided Python scripts (kaldi_decode_wav.py, kaldi_decode_live.py) and a Docker image.

Highlighted Details

Supports training of Kaldi nnet3 chain models (TDNN, GMM) and wav2letter++ models.
Includes tools for G2P conversion, lexicon management (IPA format), and language model building.
Provides scripts for audiobook segmentation and transcription, both manual and semi-automatic (Kaldi-based).
Offers model adaptation capabilities for domain-specific grammars (JSGF) and language models.

Maintenance & Community

The project is maintained by Guenter Bartsch and Marc Puels, with contributions from Paul Guyot. There are no explicit links to community forums (Discord/Slack) or a public roadmap in the README.

Licensing & Compatibility

The project's own scripts and data are LGPLv3 licensed. It notes that some scripts and files are based on original works, and users should check copyright headers for specific licensing details. This license generally permits commercial use and linking with closed-source applications.

Limitations & Caveats

The README explicitly states that the scripts do not form a complete end-user application and are primarily for developers. Setup requires considerable effort in data collection and configuration. Some scripts are noted as experimental (e.g., Zamia-TTS). The project appears to have had its last update around 2018, suggesting potential for outdated dependencies or lack of active maintenance.

zamia-speech by gooofy

Explore Similar Projects

speech-recognition-uk by egorsmkv

reverb by revdotcom

awesome-large-audio-models by EmulationAI

Speech-to-Text-Russian by SergeyShk

VITA-Audio by VITA-MLLM

dataspeech by huggingface

speech_course by yandexdataschool

athena by athena-team

alltalk_tts by erew123

icefall by k2-fsa

SenseVoice by FunAudioLLM

PaddleSpeech by PaddlePaddle