WhisperJAV  by meizhong986

Advanced ASR for specialized Japanese audio

Created 2 years ago
814 stars

Top 43.5% on SourcePulse

GitHubView on GitHub
Project Summary

A subtitle generation tool specifically designed to overcome the significant performance degradation of standard ASR models like Whisper when applied to Japanese Adult Videos (JAV). It addresses the unique challenges of JAV audio, including low SNR, non-verbal vocalizations, spectral mimicry, linguistic variance, and temporal drift, offering improved accuracy and reduced hallucinations for this niche domain. The project targets users requiring high-quality subtitles for JAV content and researchers interested in ASR for noisy, specialized audio.

How It Works

WhisperJAV employs a multi-stage inference pipeline that tackles JAV's specific acoustic and linguistic characteristics. Key strategies include Acoustic Filtering via scene-based segmentation and Voice Activity Detection (VAD) clamping to process coherent audio segments, Linguistic Adaptation to normalize domain-specific terminology and correct dialect-induced tokenization errors, and Defensive Decoding which tunes log-probability thresholds and employs regex filters to systematically discard low-confidence outputs and non-lexical markers, thereby mitigating hallucinations.

Quick Start & Requirements

  • Installation: Easiest via Windows Installer (WhisperJAV-1.7.4-Windows-x86_64.exe). Alternatively, install from source using provided scripts (install_windows.bat, install_linux.sh, install.py) which auto-detect GPUs and CUDA versions. Manual pip installation is also supported.
  • Prerequisites: Python 3.9-3.12 (3.13+ incompatible), FFmpeg in system PATH, GPU recommended (NVIDIA CUDA, Apple MPS, AMD ROCm), 8GB+ disk space. Windows users require specific NVIDIA drivers, CUDA Toolkit, and cuDNN versions.
  • Resource Footprint: GPU processing is estimated at 5-10 minutes per hour of video (NVIDIA), significantly longer for CPU-only. Model downloads can exceed 3GB.
  • Links: Windows Installer, Source Installation Scripts, Python Downloads, FFmpeg Builds.

Highlighted Details

  • Processing Modes: Offers faster, fast, balanced (default), fidelity, and transformers (utilizing a Japanese-optimized model).
  • Sensitivity Settings: conservative, balanced, aggressive to control hallucination thresholds.
  • Two-Pass Ensemble: Combines results from two different pipelines (e.g., transformers + balanced) for potentially enhanced accuracy.
  • AI Translation: Integrated subtitle translation capabilities supporting multiple providers (DeepSeek, Gemini, Claude, GPT-4, OpenRouter) with resume functionality.
  • Scene Detection: Supports auditok (default), silero, and semantic methods.
  • Japanese Post-Processing: Handles specific linguistic features like particles, aizuchi, dialects, and filters common Whisper hallucinations.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord/Slack), or sponsorship were found in the provided README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The MIT license generally permits commercial use and integration into closed-source projects. Users are responsible for legal compliance regarding content processing.

Limitations & Caveats

Python versions 3.13 and above are incompatible. AMD GPU (ROCm) support is experimental, and CPU-only processing is notably slow. The tool generates subtitles for accessibility, and users bear responsibility for adhering to relevant laws.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
44
Star History
710 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
54k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.