ID-LoRA  by ID-LoRA

Generate personalized talking videos with custom voice and appearance

Created 1 month ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ID-LoRA enables identity-preserving audio-video generation, synthesizing high-resolution talking videos with custom voice and appearance. It targets researchers and engineers, offering a unified, zero-shot approach that significantly improves quality with its latest LTX-2.3 integration.

How It Works

Built on LTX-2/LTX-2.3 (19B/22B parameter) joint audio-video diffusion, ID-LoRA uses In-Context LoRA for identity transfer. Reference audio is encoded and prepended to target latents, with negative temporal positions distinguishing them. Identity guidance amplifies speaker features. This unified approach, unlike cascaded pipelines, allows a single prompt to control visuals and audio while preserving identity. Training is lightweight, requiring ~3K pairs on a single GPU.

Quick Start & Requirements

Installation uses uv sync --frozen after cloning. Prerequisites: Python 3.11+, CUDA 12.x, 24GB VRAM (48GB recommended for two-stage). Base LTX-2 models download via scripts/download_models.sh. LTX-2.3 requires specific package management edits and re-syncing. Native ComfyUI integration is available in upstream ComfyUI (PR #13111), simplifying node-based workflows. Pre-trained checkpoints are available on HuggingFace.

Highlighted Details

ID-LoRA 2.3 supports LTX-2.3 (22B parameters), featuring improved text conditioning, enhanced audio, and a new Two-Stage HQ inference mode for superior fidelity. Benchmarks and human evaluations show ID-LoRA outperforms baselines like Kling 2.6 Pro and ElevenLabs + WAN2.2 in speaker similarity, lip sync, and overall preference, achieved through joint audio-video generation.

Maintenance & Community

Active integration with ComfyUI via custom node and upstream support. Roadmap items for evaluation datasets/scripts are pending.

Licensing & Compatibility

Licensed under its LICENSE file, the project is for research purposes only. Significant ethical considerations regarding misuse imply restrictions on commercial applications without careful review and consent.

Limitations & Caveats

High VRAM requirements (24GB+, 48GB recommended) are a barrier. Substantial ethical risks (impersonation, fraud, misinformation) necessitate strict adherence to responsible use guidelines and laws. Switching between LTX-2 and LTX-2.3 requires careful package management.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
251 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.