scGPT aims to build a foundation model for single-cell multi-omics analysis using generative AI. It provides pre-trained models and tools for tasks like cell embedding, annotation, and reference mapping, targeting researchers and bioinformaticians working with large-scale single-cell datasets.
How It Works
scGPT leverages a generative transformer architecture, similar to large language models, to learn representations from single-cell data. It processes gene expression profiles as sequences, enabling it to perform various downstream tasks through fine-tuning or zero-shot learning. The model's design allows for efficient handling of large datasets and supports flexible integration with existing bioinformatics tools.
Quick Start & Requirements
- Install via pip:
pip install scgpt "flash-attn<1.0.5"
(or pip install scgpt "flash-attn<1.0.5" "orbax<0.1.8"
if encountering orbax issues).
- Recommended: Python >= 3.7.13, R >= 3.6.1.
- Optional:
pip install wandb
for logging.
- Flash-attention dependency requires specific GPU and CUDA versions (recommend CUDA 11.7 and flash-attn<1.0.5 as of May 2023).
- Pre-trained checkpoints are available for download, with
whole-human
recommended.
- Tutorials and online apps are available for reference mapping, cell annotation, and GRN inference.
Highlighted Details
- Pre-trained on over 33 million human cells (
whole-human
model).
- Supports zero-shot cell embedding and reference mapping to millions of cells efficiently (e.g., 33M cells index < 1GB, search < 1s on GPU).
- Online apps available for browser-based interaction.
- Flash-attention is now an optional dependency, allowing CPU loading.
Maintenance & Community
- Active development with recent updates (Feb 2024) including preliminary HuggingFace integration.
- Tutorials for zero-shot applications and continual pre-trained models are available.
- Contributions are welcomed via pull requests.
Licensing & Compatibility
- License details are not explicitly stated in the README.
- Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
- The README does not explicitly state the license, which is crucial for commercial adoption.
- Flash-attention installation can be complex and requires specific hardware/software configurations.
- Some features, like pretraining code with generative attention masking and HuggingFace integration, are still under development or in preliminary stages.