NextStep-1 by stepfun-ai

Autoregressive image generation with continuous tokens

Created 11 months ago

690 stars

Top 48.5% on SourcePulse

Project Summary

NextStep-1 addresses the limitations of traditional autoregressive image generation by employing continuous image tokens, preserving visual richness without relying on costly diffusion models or lossy discrete tokens. Developed for researchers and practitioners in multimodal AI, it offers a scalable and simpler framework for state-of-the-art image generation.

How It Works

This project introduces a 14B-parameter autoregressive model that jointly processes discrete text tokens and continuous image tokens. It utilizes a standard language model head for text and a lightweight 157M-parameter flow matching head for visual generation. This unified next-token prediction approach is designed for simplicity and scalability, enabling the generation of highly detailed images by directly modeling continuous visual data.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.10, and installing dependencies using uv pip install -e .. Pre-installing PyTorch based on your CUDA version is recommended. The project provides CLI tools like smartrun for distributed training and inference/inference.py for running models. Downloading model weights and datasets can be time-consuming. Links to the project page, Hugging Face, and arXiv are available.

Highlighted Details

Achieved ICLR 2026 Oral Presentation status.
NextStep-1.1 introduces enhanced output quality through extended training and a Flow-based Reinforcement Learning (RL) post-training paradigm.
Models continuous image tokens directly, preserving full visual data richness, unlike VQ-based methods.
Features a 14B-parameter model architecture with a specialized, lightweight flow matching head for visual processing.

Maintenance & Community

The project is developed by StepFun’s Multimodal Intelligence team, with recent releases of training code and post-training blogs in February 2026. A WeChat group is available for community engagement. Checkpoints are hosted on Hugging Face and ModelScope.

Licensing & Compatibility

NextStep is licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The primary training datasets used by the NextStep team (approximately 1 billion images) are proprietary and not open-sourced; users are strongly advised to collect and prepare their own large-scale datasets. Older NextStep-1 series models are noted as less performant than the NextStep-1.1 series and are not recommended for use.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days