WhisperJAV by meizhong986

Advanced ASR for specialized Japanese audio

Created 2 years ago

1,049 stars

Top 35.7% on SourcePulse

Project Summary

A subtitle generation tool specifically designed to overcome the significant performance degradation of standard ASR models like Whisper when applied to Japanese Adult Videos (JAV). It addresses the unique challenges of JAV audio, including low SNR, non-verbal vocalizations, spectral mimicry, linguistic variance, and temporal drift, offering improved accuracy and reduced hallucinations for this niche domain. The project targets users requiring high-quality subtitles for JAV content and researchers interested in ASR for noisy, specialized audio.

How It Works

WhisperJAV employs a multi-stage inference pipeline that tackles JAV's specific acoustic and linguistic characteristics. Key strategies include Acoustic Filtering via scene-based segmentation and Voice Activity Detection (VAD) clamping to process coherent audio segments, Linguistic Adaptation to normalize domain-specific terminology and correct dialect-induced tokenization errors, and Defensive Decoding which tunes log-probability thresholds and employs regex filters to systematically discard low-confidence outputs and non-lexical markers, thereby mitigating hallucinations.

Quick Start & Requirements

Installation: Easiest via Windows Installer (WhisperJAV-1.7.4-Windows-x86_64.exe). Alternatively, install from source using provided scripts (install_windows.bat, install_linux.sh, install.py) which auto-detect GPUs and CUDA versions. Manual pip installation is also supported.
Prerequisites: Python 3.9-3.12 (3.13+ incompatible), FFmpeg in system PATH, GPU recommended (NVIDIA CUDA, Apple MPS, AMD ROCm), 8GB+ disk space. Windows users require specific NVIDIA drivers, CUDA Toolkit, and cuDNN versions.
Resource Footprint: GPU processing is estimated at 5-10 minutes per hour of video (NVIDIA), significantly longer for CPU-only. Model downloads can exceed 3GB.
Links: Windows Installer, Source Installation Scripts, Python Downloads, FFmpeg Builds.

Highlighted Details

Processing Modes: Offers faster, fast, balanced (default), fidelity, and transformers (utilizing a Japanese-optimized model).
Sensitivity Settings: conservative, balanced, aggressive to control hallucination thresholds.
Two-Pass Ensemble: Combines results from two different pipelines (e.g., transformers + balanced) for potentially enhanced accuracy.
AI Translation: Integrated subtitle translation capabilities supporting multiple providers (DeepSeek, Gemini, Claude, GPT-4, OpenRouter) with resume functionality.
Scene Detection: Supports auditok (default), silero, and semantic methods.
Japanese Post-Processing: Handles specific linguistic features like particles, aizuchi, dialects, and filters common Whisper hallucinations.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord/Slack), or sponsorship were found in the provided README.

Licensing & Compatibility

License: MIT License.
Compatibility: The MIT license generally permits commercial use and integration into closed-source projects. Users are responsible for legal compliance regarding content processing.

Limitations & Caveats

Python versions 3.13 and above are incompatible. AMD GPU (ROCm) support is experimental, and CPU-only processing is notably slow. The tool generates subtitles for accessibility, and users bear responsibility for adhering to relevant laws.

WhisperJAV by meizhong986

Explore Similar Projects

ASR-TTS-paper-daily by halsay

MioSub by corvo007

Faster-Whisper-TransWithAI-ChickenRice by TransWithAI

ComfyUI-Index-TTS by chenpipi0807

Fun-ASR by FunAudioLLM

auto-subs by tmoroney

easyVoice by cosin2077

higgs-audio by boson-ai

Zonos by Zyphra

VideoCaptioner by WEIFENG2333

seamless_communication by facebookresearch

GPT-SoVITS by RVC-Boss