Whisper-Finetune  by yeyupiaoling

Whisper finetuning and inference toolkit

Created 2 years ago
1,125 stars

Top 34.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides tools and scripts for fine-tuning OpenAI's Whisper speech recognition model using LoRA. It supports training with or without timestamp data, and even without speech data, enabling customization for specific domains or languages. The project also offers accelerated inference options and deployment solutions for web, Windows desktop, and Android applications.

How It Works

The core of the project involves fine-tuning Whisper using the LoRA (Low-Rank Adaptation) technique, which allows for efficient adaptation of large pre-trained models with significantly fewer trainable parameters. This approach enables training on diverse datasets, including those lacking timestamp information or even speech content for specific tasks. For inference, it leverages CTranslate2 and GGML for accelerated performance, and integrates with Hugging Face's Transformers library for broader compatibility.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt (or use provided Docker image pytorch/pytorch:2.4.0-cuda11.8-cudnn9-devel).
  • Prerequisites: Python 3.11, PyTorch 2.4.0, CUDA 11.8 (recommended), GPU (A100-PCIE-40GB used in examples). Windows users may need bitsandbytes from a specific GitHub release.
  • Data Preparation: Requires data in a JSON Lines format, with an aishell.py script provided for processing the AIShell dataset.
  • Resources: Fine-tuning requires significant GPU memory and compute. Inference acceleration options are available.
  • Documentation: Web Deployment, API Docs.

Highlighted Details

  • Supports fine-tuning Whisper models (tiny, base, small, medium, large, large-v2, large-v3).
  • Offers accelerated inference via CTranslate2 and GGML.
  • Enables deployment to Web (API server), Windows desktop, and Android applications.
  • Includes performance benchmarks showing significant speedups with various optimization techniques (Flash Attention 2, Compile, BetterTransformer).
  • Provides detailed character error rate (CER) and word error rate (WER) test tables for various models and datasets.

Maintenance & Community

  • Active development is implied by the inclusion of recent Whisper versions (large-v3).
  • Community discussion is encouraged via a knowledge planet and QQ group.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The underlying Whisper model is released under the MIT license.
  • Compatibility for commercial use depends on the licensing of the base Whisper model and any other dependencies.

Limitations & Caveats

  • Some model files and processed datasets are only available through the author's "knowledge planet," requiring potential payment or membership.
  • The README mentions that removing punctuation during evaluation might be necessary for accuracy, implying potential issues with punctuation handling in fine-tuned models.
Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Aakanksha Chowdhery Aakanksha Chowdhery(Author of PaLM; Research Scientist at Reflection AI; Adjunct Professor at Stanford) and Tim J. Baek Tim J. Baek(Founder of Open WebUI).

F5-TTS by SWivid

0.4%
13k
Speech model for fluent, faithful speech with flow matching
Created 11 months ago
Updated 5 days ago
Feedback? Help us improve.