raylight  by komikndr

Multi-GPU parallelism for Comfy UI

Created 7 months ago
296 stars

Top 89.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Raylight addresses the need for true multi-GPU parallelism in ComfyUI, enabling users to scale image and video generation beyond single high-end GPUs. By leveraging Ray for distributed computing, XDiT's Unified Sequence Parallelism (USP) for tensor splitting, and PyTorch's Fully Sharded Data Parallel (FSDP) for model weight sharding, it effectively combines VRAM across multiple GPUs. This allows for more efficient resource utilization and the ability to run larger models or generate higher-resolution content.

How It Works

Raylight integrates as a ComfyUI node, orchestrating GPU workers via the Ray framework. It employs USP to split tensor computations across GPUs, allowing each to process parts of a sequence. Crucially, FSDP shards the model weights themselves across available GPUs, overcoming the limitation where each GPU must load the entire model. This dual approach of USP and FSDP maximizes GPU utilization, effectively pooling VRAM and enabling scalable inference.

Quick Start & Requirements

Installation is straightforward via cloning to ComfyUI/custom_nodes and running pip install -r requirements.txt, or by using the ComfyUI Manager. Key dependencies include nvidia-nccl-cu12==2.28.9 for FSDP with fp8 models on Nvidia GPUs, and optional FlashAttention 2. Windows users are advised to use WSL due to installation complexities.

Highlighted Details

  • Extensive model support including Wan, Flux, Chroma, Qwen, Z Image, Lumina 2, Hunyuan Video, Kandinsky5, LTX-2, SD1.5, and SDXL.
  • Multiple operational modes: Sequence Parallelism (USP), Data Parallel (DP), FSDP, and combined Sequence + FSDP.
  • FSDP CPU Offload is available for extremely low VRAM scenarios.
  • Supports AMD ROCm architectures (MI3XX, MI210).
  • Benchmarks demonstrate significant speedups and VRAM efficiency gains with multi-GPU configurations.

Maintenance & Community

The project acknowledges contributions from specific users (rmatif, City96) and provides a PayPal link for support. No explicit community channels (Discord, Slack) or detailed roadmap are detailed in the provided text.

Licensing & Compatibility

No license information is explicitly stated in the provided README text, which is a critical omission for assessing commercial use or derivative works.

Limitations & Caveats

Known issues include potential dequantized errors with ComfyUI's mixed-precision fp8 models and VRAM leakage with specific Ring/Ulysses configurations. PyTorch 2.8.1+ is recommended for FSDP stability. Some models have limited or no USP/FSDP support. Windows installation is complex, with WSL being the recommended workaround.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
11
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Amanpreet Singh Amanpreet Singh(Cofounder of Contextual AI) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code).

torchshard by kaiyuyue

0%
300
PyTorch engine for tensor slicing into parallel shards
Created 4 years ago
Updated 8 months ago
Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 4 years ago
Updated 3 years ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
791
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI).

xDiT by xdit-project

0.1%
3k
Inference engine for parallel Diffusion Transformer (DiT) deployment
Created 1 year ago
Updated 4 days ago
Feedback? Help us improve.