cbp-translate  by elanmart

Cyberpunk-style video translation with modern DL stack

created 2 years ago
1,277 stars

Top 31.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a proof-of-concept for real-time video translation, mimicking the subtitle style of Cyberpunk 2077. It targets users interested in automated dubbing and subtitling for video content, offering a pipeline that detects speakers, transcribes speech, translates it, and overlays subtitles onto the original video.

How It Works

The system integrates multiple pre-trained ML models to achieve its functionality. It uses ffmpeg-python for video and audio processing, Whisper for speech-to-text, NVIDIA NeMo for speaker diarization, DeepL for translation, and RetinaFace with DeepFace for face detection and embedding. Speaker and face IDs are matched using heuristics, and subtitles are generated and overlaid using PIL and OpenCV. The architecture is designed for serverless deployment using Modal and features a Gradio frontend for user interaction.

Quick Start & Requirements

  • Modal Deployment: Requires a Modal account, HuggingFace token, and DeepL API key. Install with pip install -r requirements-modal.txt and run python cbp_translate/app.py.
  • Local Development: Requires ffmpeg, libsndfile1, git, build-essential, CUDA/cuDNN. Install dependencies via requirements-local.txt and run CLI commands.
  • Dependencies: Python 3.x, ffmpeg, git-lfs for large files.
  • Resources: Processing 30s of video takes several minutes on a modern PC.

Highlighted Details

  • Leverages off-the-shelf, pre-trained models, avoiding the need for gradient updates or custom data labeling.
  • Demonstrates integration of speech recognition, speaker diarization, translation, and facial recognition.
  • Utilizes Modal for serverless cloud deployment, enabling remote execution with minimal boilerplate.
  • Offers a Gradio frontend for an interactive demo experience.

Maintenance & Community

The project is maintained by elanmart. Links to community resources like Discord/Slack are not provided in the README.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

This is a proof-of-concept with significant limitations: processing is slow (minutes per 30s video), it struggles with multiple scenes, speaker/face matching heuristics are basic and can fail, and the pipeline relies on imperfect tools. It has only been tested on a limited set of examples. Font handling for non-Latin characters is not robust.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.