SFHQ-dataset by SelfishGene

High-resolution synthetic face dataset for generative AI research

Created 3 years ago

254 stars

Top 99.1% on SourcePulse

Project Summary

Summary

The SFHQ dataset offers approximately 425,000 high-quality, 1024x1024 synthetic face images. It addresses the need for large-scale, privacy-free facial data for training machine learning models or augmenting existing datasets, providing significant variability in identity, ethnicity, age, pose, expression, and lighting.

How It Works

Inspiration images (paintings, 3D models, text-to-image outputs) are encoded into StyleGAN2 latent space via the e4e encoder. Latent space manipulation generates photorealistic faces. A semi-automatic curation process using a "visual taste approximator" and CLIP features ensures high quality and removes near-duplicates (CLIP similarity < 0.92), yielding a large, diverse dataset.

Quick Start & Requirements

Download is available via Kaggle. Implied dependencies include StyleGAN2, e4e encoder, CLIP, Face Parsing BiSeNet, and Dlib. An example script (explore_dataset.py) and a live Kaggle notebook demonstrate accessing features like landmarks, segmentation maps, and performing textual searches.

Highlighted Details

~425,000 curated 1024x1024 synthetic face images.
Extended 110 facial landmark points, including hairlines.
Semantic segmentation maps from Face Parsing BiSeNet.
CLIP image/text feature vectors for textual querying.
Near-duplicate images removed (CLIP similarity < ~0.92).

Maintenance & Community

Created by David Beniaguev, with the GitHub repository (SelfishGene/SFHQ-dataset) as the primary resource. No specific community channels or maintenance details are provided in the README.

Licensing & Compatibility

Described as having "no privacy issues or license issues" due to synthetic generation. A specific open-source license is not stated, requiring clarification for commercial use.

Limitations & Caveats

Limited variability in accessories (hats, earphones) and jewelry; minimal occlusions beyond hair self-occlusion. Inherits biases from source datasets (FFHQ, AAHQ) and generative models (StyleGAN2, Stable Diffusion). A newer dataset, SFHQ-T2I, is also mentioned.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days