Cyberpunk-style video translation with modern DL stack
Top 31.8% on sourcepulse
This project provides a proof-of-concept for real-time video translation, mimicking the subtitle style of Cyberpunk 2077. It targets users interested in automated dubbing and subtitling for video content, offering a pipeline that detects speakers, transcribes speech, translates it, and overlays subtitles onto the original video.
How It Works
The system integrates multiple pre-trained ML models to achieve its functionality. It uses ffmpeg-python
for video and audio processing, Whisper
for speech-to-text, NVIDIA NeMo
for speaker diarization, DeepL
for translation, and RetinaFace
with DeepFace
for face detection and embedding. Speaker and face IDs are matched using heuristics, and subtitles are generated and overlaid using PIL
and OpenCV
. The architecture is designed for serverless deployment using Modal
and features a Gradio
frontend for user interaction.
Quick Start & Requirements
Modal
account, HuggingFace
token, and DeepL
API key. Install with pip install -r requirements-modal.txt
and run python cbp_translate/app.py
.ffmpeg
, libsndfile1
, git
, build-essential
, CUDA/cuDNN. Install dependencies via requirements-local.txt
and run CLI commands.ffmpeg
, git-lfs
for large files.Highlighted Details
Modal
for serverless cloud deployment, enabling remote execution with minimal boilerplate.Gradio
frontend for an interactive demo experience.Maintenance & Community
The project is maintained by elanmart. Links to community resources like Discord/Slack are not provided in the README.
Licensing & Compatibility
The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
This is a proof-of-concept with significant limitations: processing is slow (minutes per 30s video), it struggles with multiple scenes, speaker/face matching heuristics are basic and can fail, and the pipeline relies on imperfect tools. It has only been tested on a limited set of examples. Font handling for non-Latin characters is not robust.
2 years ago
1 week