NextFlow  by ByteVisionLab

Unified multimodal AI for generation and understanding

Created 2 months ago
290 stars

Top 91.1% on SourcePulse

GitHubView on GitHub
Project Summary

NextFlow addresses the fragmentation in multimodal AI by offering a unified decoder-only autoregressive transformer for understanding, generation, and editing. Targeting researchers and power users, it enables high-fidelity multimodal output and complex reasoning within a single, efficient architecture, eliminating the need for separate diffusion or LLM backbones.

How It Works

NextFlow employs a decoder-only transformer architecture, initialized from Qwen2.5-VL-7B, trained on 6 trillion interleaved text-image tokens. Its core innovations include a Unified Tokenizer, Scale Reweighting, and Self-Correction with Residual Features for stable training. A novel hierarchical prediction paradigm and Reinforcement Learning via Group Reward Policy Optimization (GRPO) enable efficient, high-quality generation and advanced capabilities like Chain-of-Thought reasoning and in-context editing.

Quick Start & Requirements

  • Prerequisites: Requires initialization from the Qwen2.5-VL-7B model. Inference likely necessitates significant GPU resources and CUDA support.
  • Setup: Specific installation and execution commands are not detailed.
  • Resources: Training utilized 6 trillion tokens; inference efficiency is highlighted (1024x1024 in 5s, 6x fewer FLOPs than MMDiT).
  • Links: Papers available via arXiv (2601.02204, 2601.02256); a demo is mentioned but not linked.

Highlighted Details

  • Performance: Achieves state-of-the-art scores on DPG (88.32) and ImgEdit (4.49) benchmarks, matching specialized diffusion models in quality.
  • Efficiency: Generates 1024x1024 images in 5 seconds and requires 6x fewer FLOPs than MMDiT-based diffusion models.
  • Capabilities: Supports native Chain-of-Thought reasoning, in-context editing, interleaved generation, and dynamic resolution generation without re-encoding overhead.
  • Benchmark: Introduces EditCanvas, a new benchmark for evaluating editing and subject-driven generation tasks.

Maintenance & Community

No specific details regarding contributors, community channels (Discord, Slack), roadmap, or sponsorships are provided in the README.

Licensing & Compatibility

The README does not specify a software license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The README does not explicitly state limitations, alpha status, or known bugs. The provided arXiv paper dates (2026) suggest the project may be future work or not yet publicly released in a stable form.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
5
Star History
290 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.