scGPT  by bowang-lab

Foundation model for single-cell multi-omics research

created 2 years ago
1,289 stars

Top 31.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

scGPT aims to build a foundation model for single-cell multi-omics analysis using generative AI. It provides pre-trained models and tools for tasks like cell embedding, annotation, and reference mapping, targeting researchers and bioinformaticians working with large-scale single-cell datasets.

How It Works

scGPT leverages a generative transformer architecture, similar to large language models, to learn representations from single-cell data. It processes gene expression profiles as sequences, enabling it to perform various downstream tasks through fine-tuning or zero-shot learning. The model's design allows for efficient handling of large datasets and supports flexible integration with existing bioinformatics tools.

Quick Start & Requirements

  • Install via pip: pip install scgpt "flash-attn<1.0.5" (or pip install scgpt "flash-attn<1.0.5" "orbax<0.1.8" if encountering orbax issues).
  • Recommended: Python >= 3.7.13, R >= 3.6.1.
  • Optional: pip install wandb for logging.
  • Flash-attention dependency requires specific GPU and CUDA versions (recommend CUDA 11.7 and flash-attn<1.0.5 as of May 2023).
  • Pre-trained checkpoints are available for download, with whole-human recommended.
  • Tutorials and online apps are available for reference mapping, cell annotation, and GRN inference.

Highlighted Details

  • Pre-trained on over 33 million human cells (whole-human model).
  • Supports zero-shot cell embedding and reference mapping to millions of cells efficiently (e.g., 33M cells index < 1GB, search < 1s on GPU).
  • Online apps available for browser-based interaction.
  • Flash-attention is now an optional dependency, allowing CPU loading.

Maintenance & Community

  • Active development with recent updates (Feb 2024) including preliminary HuggingFace integration.
  • Tutorials for zero-shot applications and continual pre-trained models are available.
  • Contributions are welcomed via pull requests.

Licensing & Compatibility

  • License details are not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The README does not explicitly state the license, which is crucial for commercial adoption.
  • Flash-attention installation can be complex and requires specific hardware/software configurations.
  • Some features, like pretraining code with generative attention masking and HuggingFace integration, are still under development or in preliminary stages.
Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
74 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.