mammothmoda  by bytedance

Unified multimodal AI for generation and editing

Created 8 months ago
273 stars

Top 94.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

MammothModa is a unified AR-Diffusion model family designed for seamless multimodal understanding, generation, and editing. It addresses the need for a single architecture capable of handling text-to-image, text-to-video, image editing, and video editing tasks, offering significant efficiency gains and consolidated functionality for AI practitioners.

How It Works

The project leverages a Mixture-of-Experts (MoE) architecture, specifically a Fine-Grained MoE with 128 experts and Top-8 routing, enabling approximately 3 billion active parameters out of a total of 25 billion. This design, integrated with Diffusion Transformer (DiT) models and Qwen-VL foundation models, enables efficient inference (12x faster) and supporting multimodal tasks within a single model.

Quick Start & Requirements

Model weights and inference code for Mamoda2.5 (video generation/editing) and Mamoda2-Dev (image editing) are available on HuggingFace. Users may need to switch to specific branches (e.g., qwen25vl for preview versions). Explicit installation commands are not provided. Project Page: https://mammothmoda.github.io/ Tech Reports: https://arxiv.org/abs/2511.18262, https://arxiv.org/abs/2605.02641

Highlighted Details

  • Fine-Grained MoE: Achieves ~12% parameter activation per forward pass, enabling 12x faster inference compared to dense models of similar capacity.
  • Unified Capabilities: A single model supports text-to-image, text-to-video, image editing, and video editing.
  • State-of-the-Art Performance: Achieves #1 on OpenVE-Bench/FiVE-Bench for video editing and competitive VBench 2.0 video generation results.

Maintenance & Community

The project is actively developed by the Moderation LLM Team at ByteDance, evidenced by numerous recent publications in top AI conferences (ICLR, ICCV, NeurIPS). Contact: liuchang.lab@bytedance.com. No community channels (e.g., Discord, Slack) or public roadmap are specified.

Licensing & Compatibility

The README does not specify a software license. Open-source model weights are noted as being "under internal review," suggesting potential future release with current uncertainty on terms of use.

Limitations & Caveats

Open-source model weights are currently under internal review and not fully released. Users may need specific branches for certain versions (e.g., preview models), indicating potential instability.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
8
Star History
182 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.