cbp-translate by elanmart

Cyberpunk-style video translation with modern DL stack

Created 3 years ago

1,275 stars

Top 31.0% on SourcePulse

View on GitHub

5 Experts Love This Project

Nat Friedman

Former CEO of GitHub

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Jeff Hammerbacher

Cofounder of Cloudera

Akshat Bubna

Cofounder of Modal

and 1 more!

Project Summary

This project provides a proof-of-concept for real-time video translation, mimicking the subtitle style of Cyberpunk 2077. It targets users interested in automated dubbing and subtitling for video content, offering a pipeline that detects speakers, transcribes speech, translates it, and overlays subtitles onto the original video.

How It Works

The system integrates multiple pre-trained ML models to achieve its functionality. It uses ffmpeg-python for video and audio processing, Whisper for speech-to-text, NVIDIA NeMo for speaker diarization, DeepL for translation, and RetinaFace with DeepFace for face detection and embedding. Speaker and face IDs are matched using heuristics, and subtitles are generated and overlaid using PIL and OpenCV. The architecture is designed for serverless deployment using Modal and features a Gradio frontend for user interaction.

Quick Start & Requirements

Modal Deployment: Requires a Modal account, HuggingFace token, and DeepL API key. Install with pip install -r requirements-modal.txt and run python cbp_translate/app.py.
Local Development: Requires ffmpeg, libsndfile1, git, build-essential, CUDA/cuDNN. Install dependencies via requirements-local.txt and run CLI commands.
Dependencies: Python 3.x, ffmpeg, git-lfs for large files.
Resources: Processing 30s of video takes several minutes on a modern PC.

Highlighted Details

Leverages off-the-shelf, pre-trained models, avoiding the need for gradient updates or custom data labeling.
Demonstrates integration of speech recognition, speaker diarization, translation, and facial recognition.
Utilizes Modal for serverless cloud deployment, enabling remote execution with minimal boilerplate.
Offers a Gradio frontend for an interactive demo experience.

Maintenance & Community

The project is maintained by elanmart. Links to community resources like Discord/Slack are not provided in the README.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

This is a proof-of-concept with significant limitations: processing is slow (minutes per 30s video), it struggles with multiple scenes, speaker/face matching heuristics are basic and can fail, and the pipeline relies on imperfect tools. It has only been tested on a limited set of examples. Font handling for non-Latin characters is not robust.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days