LTX-Video is a DiT-based video generation model designed for real-time, high-quality video creation. It targets researchers and developers interested in advanced video synthesis, offering capabilities like text-to-video, image-to-video, and video extension, with a focus on speed and resolution.
How It Works
LTX-Video utilizes a Diffusion Transformer (DiT) architecture, enabling it to generate high-resolution videos at 30 FPS in real-time. This approach allows for faster-than-watch-time generation, a significant improvement over previous methods. The model is trained on a large, diverse video dataset, facilitating the creation of realistic and varied content.
Quick Start & Requirements
- Installation: Clone the repository, create a virtual environment, and install with
pip install -e .[inference-script]
.
- Dependencies: Python 3.10.5+, CUDA 12.2+, PyTorch >= 2.1.2. MPS support for macOS requires PyTorch 2.3.0 or >= 2.6.
- Model Download: Use
hf_hub_download
from Hugging Face to get the distilled or full model checkpoints.
- Inference: Run via
inference.py
script for text-to-video, image-to-video, and video extension.
- ComfyUI/Diffusers: Integrations available via separate repositories and official documentation.
- Resources: Requires significant GPU resources for local inference.
- Links: Website, Model, Demo, Paper.
Highlighted Details
- Generates 30 FPS videos at 1216x704 resolution in real-time.
- Supports text-to-video, image-to-video, keyframe animation, video extension (forward/backward), and video-to-video transformations.
- Distilled model offers 15x faster inference, supports fewer diffusion steps, and omits classifier-free guidance.
- Features automatic prompt enhancement for shorter prompts.
Maintenance & Community
- Active development with regular updates and new checkpoints.
- Community contributions are encouraged, with projects like ComfyUI-LTXTricks and LTX-VideoQ8 highlighted.
- Links to community discussions and careers page available.
Licensing & Compatibility
- Newer checkpoints (v0.9.6, v0.9.5) are released under an "Open Weights" or "OpenRail-M" license, allowing commercial use. Earlier versions may have different terms.
Limitations & Caveats
- Input video segments for extension require specific frame counts (multiple of 8 + 1).
- Optimal resolutions are under 720x1280 and frame counts below 257.
- While real-time, performance is highly dependent on hardware, especially for higher resolutions and frame counts.