whisper-finetune by vasistalodagala

Whisper fine-tuning scripts for ASR tasks

Created 2 years ago

356 stars

Top 78.6% on SourcePulse

Project Summary

This repository provides scripts for fine-tuning and evaluating OpenAI's Whisper models for Automatic Speech Recognition (ASR) on custom or Hugging Face datasets. It targets researchers and developers looking to adapt Whisper for specific languages, accents, or noisy audio conditions, enabling improved ASR performance.

How It Works

The project leverages the Hugging Face Transformers library to load and fine-tune various Whisper model configurations. It supports both Hugging Face datasets and custom datasets, which require a specific two-file format (audio_paths and text) for preparation. The scripts facilitate distributed training across multiple GPUs and offer options for hyperparameter tuning, including learning rate recommendations based on model size. An alternative faster evaluation path using whisper-jax is also provided for improved inference speed.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -r requirements.txt within a Python 3.8 virtual environment.
Prerequisites: CUDA 11.3 is recommended. git-lfs is required for pushing models to Hugging Face.
Setup: Virtual environment setup and dependency installation are straightforward.
Docs: Detailed usage examples for fine-tuning, evaluation, and transcription are provided within the README.

Highlighted Details

Supports fine-tuning on both Hugging Face datasets and custom audio data.
Includes scripts for extracting encoder and decoder embeddings for downstream tasks.
Offers faster inference and evaluation via whisper-jax integration.
Provides guidance on hyperparameter tuning, particularly learning rates for fine-tuning.

Maintenance & Community

The repository is maintained by vasistalodagala. Further community engagement or roadmap details are not explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it relies heavily on the Hugging Face Transformers library, which is typically under the Apache 2.0 license, making it generally compatible with commercial use.

Limitations & Caveats

Audio segments processed for embedding extraction should not exceed 30 seconds due to Whisper's positional embedding limitations. The whisper-jax integration requires models to have available Flax weights on Hugging Face.

whisper-finetune by vasistalodagala

Explore Similar Projects

VoiceStar by jasonppy

OSUM by ASLP-lab

insanely-fast-whisper-cli by ochen1

dataspeech by huggingface

SLAM-LLM by X-LANCE

awesome-whisper by sindresorhus

whisper-ctranslate2 by Softcatala

Whisper-Finetune by yeyupiaoling

vall-e by lifeiteng

faster-whisper by SYSTRAN

go-openai by sashabaranov

PaddleSpeech by PaddlePaddle