scGPT  by bowang-lab

Foundation model for single-cell multi-omics research

Created 2 years ago
1,322 stars

Top 30.3% on SourcePulse

GitHubView on GitHub
Project Summary

scGPT aims to build a foundation model for single-cell multi-omics analysis using generative AI. It provides pre-trained models and tools for tasks like cell embedding, annotation, and reference mapping, targeting researchers and bioinformaticians working with large-scale single-cell datasets.

How It Works

scGPT leverages a generative transformer architecture, similar to large language models, to learn representations from single-cell data. It processes gene expression profiles as sequences, enabling it to perform various downstream tasks through fine-tuning or zero-shot learning. The model's design allows for efficient handling of large datasets and supports flexible integration with existing bioinformatics tools.

Quick Start & Requirements

  • Install via pip: pip install scgpt "flash-attn<1.0.5" (or pip install scgpt "flash-attn<1.0.5" "orbax<0.1.8" if encountering orbax issues).
  • Recommended: Python >= 3.7.13, R >= 3.6.1.
  • Optional: pip install wandb for logging.
  • Flash-attention dependency requires specific GPU and CUDA versions (recommend CUDA 11.7 and flash-attn<1.0.5 as of May 2023).
  • Pre-trained checkpoints are available for download, with whole-human recommended.
  • Tutorials and online apps are available for reference mapping, cell annotation, and GRN inference.

Highlighted Details

  • Pre-trained on over 33 million human cells (whole-human model).
  • Supports zero-shot cell embedding and reference mapping to millions of cells efficiently (e.g., 33M cells index < 1GB, search < 1s on GPU).
  • Online apps available for browser-based interaction.
  • Flash-attention is now an optional dependency, allowing CPU loading.

Maintenance & Community

  • Active development with recent updates (Feb 2024) including preliminary HuggingFace integration.
  • Tutorials for zero-shot applications and continual pre-trained models are available.
  • Contributions are welcomed via pull requests.

Licensing & Compatibility

  • License details are not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The README does not explicitly state the license, which is crucial for commercial adoption.
  • Flash-attention installation can be complex and requires specific hardware/software configurations.
  • Some features, like pretraining code with generative attention masking and HuggingFace integration, are still under development or in preliminary stages.
Health Check
Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
4
Issues (30d)
5
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.