whisper-finetune  by vasistalodagala

Whisper fine-tuning scripts for ASR tasks

created 2 years ago
322 stars

Top 85.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides scripts for fine-tuning and evaluating OpenAI's Whisper models for Automatic Speech Recognition (ASR) on custom or Hugging Face datasets. It targets researchers and developers looking to adapt Whisper for specific languages, accents, or noisy audio conditions, enabling improved ASR performance.

How It Works

The project leverages the Hugging Face Transformers library to load and fine-tune various Whisper model configurations. It supports both Hugging Face datasets and custom datasets, which require a specific two-file format (audio_paths and text) for preparation. The scripts facilitate distributed training across multiple GPUs and offer options for hyperparameter tuning, including learning rate recommendations based on model size. An alternative faster evaluation path using whisper-jax is also provided for improved inference speed.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -r requirements.txt within a Python 3.8 virtual environment.
  • Prerequisites: CUDA 11.3 is recommended. git-lfs is required for pushing models to Hugging Face.
  • Setup: Virtual environment setup and dependency installation are straightforward.
  • Docs: Detailed usage examples for fine-tuning, evaluation, and transcription are provided within the README.

Highlighted Details

  • Supports fine-tuning on both Hugging Face datasets and custom audio data.
  • Includes scripts for extracting encoder and decoder embeddings for downstream tasks.
  • Offers faster inference and evaluation via whisper-jax integration.
  • Provides guidance on hyperparameter tuning, particularly learning rates for fine-tuning.

Maintenance & Community

The repository is maintained by vasistalodagala. Further community engagement or roadmap details are not explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it relies heavily on the Hugging Face Transformers library, which is typically under the Apache 2.0 license, making it generally compatible with commercial use.

Limitations & Caveats

Audio segments processed for embedding extraction should not exceed 30 seconds due to Whisper's positional embedding limitations. The whisper-jax integration requires models to have available Flax weights on Hugging Face.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.