This repository provides tools and scripts for fine-tuning OpenAI's Whisper speech recognition model using LoRA. It supports training with or without timestamp data, and even without speech data, enabling customization for specific domains or languages. The project also offers accelerated inference options and deployment solutions for web, Windows desktop, and Android applications.
How It Works
The core of the project involves fine-tuning Whisper using the LoRA (Low-Rank Adaptation) technique, which allows for efficient adaptation of large pre-trained models with significantly fewer trainable parameters. This approach enables training on diverse datasets, including those lacking timestamp information or even speech content for specific tasks. For inference, it leverages CTranslate2 and GGML for accelerated performance, and integrates with Hugging Face's Transformers library for broader compatibility.
Quick Start & Requirements
- Installation:
pip install -r requirements.txt
(or use provided Docker image pytorch/pytorch:2.4.0-cuda11.8-cudnn9-devel
).
- Prerequisites: Python 3.11, PyTorch 2.4.0, CUDA 11.8 (recommended), GPU (A100-PCIE-40GB used in examples). Windows users may need
bitsandbytes
from a specific GitHub release.
- Data Preparation: Requires data in a JSON Lines format, with an
aishell.py
script provided for processing the AIShell dataset.
- Resources: Fine-tuning requires significant GPU memory and compute. Inference acceleration options are available.
- Documentation: Web Deployment, API Docs.
Highlighted Details
- Supports fine-tuning Whisper models (tiny, base, small, medium, large, large-v2, large-v3).
- Offers accelerated inference via CTranslate2 and GGML.
- Enables deployment to Web (API server), Windows desktop, and Android applications.
- Includes performance benchmarks showing significant speedups with various optimization techniques (Flash Attention 2, Compile, BetterTransformer).
- Provides detailed character error rate (CER) and word error rate (WER) test tables for various models and datasets.
Maintenance & Community
- Active development is implied by the inclusion of recent Whisper versions (large-v3).
- Community discussion is encouraged via a knowledge planet and QQ group.
Licensing & Compatibility
- The repository itself does not explicitly state a license. The underlying Whisper model is released under the MIT license.
- Compatibility for commercial use depends on the licensing of the base Whisper model and any other dependencies.
Limitations & Caveats
- Some model files and processed datasets are only available through the author's "knowledge planet," requiring potential payment or membership.
- The README mentions that removing punctuation during evaluation might be necessary for accuracy, implying potential issues with punctuation handling in fine-tuned models.