TransPixeler by wileewang

Text-to-video generation research paper focusing on transparency

Created 1 year ago

908 stars

Top 40.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Kevin Hou

Head of Product Engineering at Windsurf

Project Summary

TransPixeler enables text-to-video generation of RGBA (including alpha channel for transparency) content, a capability crucial for visual effects and seamless scene integration. It targets researchers and developers in computer vision and generative AI seeking to extend existing video models for transparency applications. The primary benefit is the ability to generate videos with controllable transparency, enhancing realism and creative possibilities.

How It Works

TransPixeler adapts pre-trained diffusion transformer (DiT) video models for RGBA generation. It incorporates alpha-specific tokens and utilizes LoRA-based fine-tuning to jointly generate RGB and alpha channels. This approach optimizes attention mechanisms to maintain the original model's RGB quality while ensuring high consistency between the RGB and alpha outputs, even with limited training data.

Quick Start & Requirements

Install via pip install -r requirements.txt within a conda environment (Python 3.10 recommended).
Requires LoRA weights for inference.
Local inference demo available via python app.py.
CLI inference: python cli.py --lora_path /path/to/lora --prompt "..."
For joint generation with Wan2.1, checkout the wan branch and ensure data follows 001.mp4, 001_seg.mp4, 001.txt structure.
Official Hugging Face demo available.

Highlighted Details

CVPR 2025 accepted paper.
Supports Text-to-RGBA Video and Image-to-RGBA Video.
LoRA weights provided for Text-to-Video + RGBA using THUDM/CogVideoX-5B (49 frames, ~24GB VRAM).
New wan branch supports joint generation of RGB and associated modalities (e.g., segmentation maps, alpha masks) with Wan2.1.

Maintenance & Community

Active development with recent updates including a new wan branch for joint generation and roadmap additions for Hunyuan, LTX, and ComfyUI integration.
Discord and WeChat groups available for discussion and collaboration.
Project page and arXiv paper available.

Licensing & Compatibility

License details are not explicitly stated in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as research for CVPR 2025, implying it may be experimental. Specific hardware requirements for training or advanced inference scenarios (beyond the stated ~24GB VRAM for provided LoRA weights) are not detailed. License information for commercial use is absent.

Health Check

Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days