NextStep-1  by stepfun-ai

Autoregressive image generation with continuous tokens

Created 6 months ago
621 stars

Top 53.1% on SourcePulse

GitHubView on GitHub
Project Summary

NextStep-1 addresses the limitations of traditional autoregressive image generation by employing continuous image tokens, preserving visual richness without relying on costly diffusion models or lossy discrete tokens. Developed for researchers and practitioners in multimodal AI, it offers a scalable and simpler framework for state-of-the-art image generation.

How It Works

This project introduces a 14B-parameter autoregressive model that jointly processes discrete text tokens and continuous image tokens. It utilizes a standard language model head for text and a lightweight 157M-parameter flow matching head for visual generation. This unified next-token prediction approach is designed for simplicity and scalability, enabling the generation of highly detailed images by directly modeling continuous visual data.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.10, and installing dependencies using uv pip install -e .. Pre-installing PyTorch based on your CUDA version is recommended. The project provides CLI tools like smartrun for distributed training and inference/inference.py for running models. Downloading model weights and datasets can be time-consuming. Links to the project page, Hugging Face, and arXiv are available.

Highlighted Details

  • Achieved ICLR 2026 Oral Presentation status.
  • NextStep-1.1 introduces enhanced output quality through extended training and a Flow-based Reinforcement Learning (RL) post-training paradigm.
  • Models continuous image tokens directly, preserving full visual data richness, unlike VQ-based methods.
  • Features a 14B-parameter model architecture with a specialized, lightweight flow matching head for visual processing.

Maintenance & Community

The project is developed by StepFun’s Multimodal Intelligence team, with recent releases of training code and post-training blogs in February 2026. A WeChat group is available for community engagement. Checkpoints are hosted on Hugging Face and ModelScope.

Licensing & Compatibility

NextStep is licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The primary training datasets used by the NextStep team (approximately 1 billion images) are proprietary and not open-sourced; users are strongly advised to collect and prepare their own large-scale datasets. Older NextStep-1 series models are noted as less performant than the NextStep-1.1 series and are not recommended for use.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
29 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.2%
2k
Suite of neural tokenizers for image and video processing
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.