ERNIE-Image  by baidu

Advanced text-to-image generation model

Created 2 weeks ago

New!

388 stars

Top 73.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ERNIE-Image is an open-weight, text-to-image generation model from Baidu, achieving state-of-the-art performance with a compact 8B Diffusion Transformer (DiT) architecture. It targets researchers and developers, enabling high-quality image synthesis on consumer hardware, with strengths in text-heavy visuals, complex instruction following, and structured content generation.

How It Works

The core architecture features a single-stream Diffusion Transformer (DiT) comprising 8 billion parameters. It is enhanced by a lightweight Prompt Enhancer (PE) that expands brief user inputs into richer, structured descriptions. This synergistic approach allows ERNIE-Image to rival larger models, particularly for precise text rendering and intricate scene composition.

Quick Start & Requirements

  • Primary install via Hugging Face diffusers: pip install git+https://github.com/huggingface/diffusers then pip install -e . in the cloned repo.
  • Prerequisites: CUDA, torch_dtype=torch.bfloat16.
  • Hardware: Consumer GPUs with 24GB VRAM.
  • Links: Huggingface Demo, AI Studio Demo, Blog, Discord, X.

Highlighted Details

  • Compact Scale: State-of-the-art performance among open-weight models with 8B DiT parameters, outperforming larger models.
  • Text Rendering: Excels in dense, long-form, layout-sensitive text for posters, infographics, and UI elements.
  • Instruction Following: Reliably handles complex prompts with multiple objects, detailed relationships, and knowledge-intensive descriptions.
  • Structured Generation: Effective for comics, storyboards, and multi-panel compositions.
  • Deployment: Practical on consumer GPUs with 24GB VRAM.
  • Versions: Offers ERNIE-Image (50 steps, CFG 4.0) and ERNIE-Image-Turbo (8 steps, CFG 1.0) for faster generation.

Maintenance & Community

  • Community channels: WeChat, Discord, X.
  • Contact: wenxin-all@baidu.com.
  • Supports ComfyUI integration and Unsloth for GGUF weights.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • This permissive license is generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The README does not explicitly detail limitations, alpha status, or known bugs. Performance metrics show variations between ERNIE-Image (w/ PE) and ERNIE-Image (w/o PE), highlighting the Prompt Enhancer's significant impact on certain benchmarks.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
7
Star History
392 stars in the last 14 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 3 years ago
Updated 2 years ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
57 more.

stable-diffusion by CompVis

0.1%
73k
Latent text-to-image diffusion model
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.