GenePT  by yiqunchen

Single-cell foundation model leveraging ChatGPT embeddings for gene/cell biology

Created 1 year ago
283 stars

Top 92.3% on SourcePulse

GitHubView on GitHub
Project Summary

GenePT is a foundation model for single-cell biology, offering a user-friendly and efficient approach to gene-level and cell-level tasks by leveraging ChatGPT embeddings of NCBI gene descriptions. It is designed for researchers and bioinformaticians working with single-cell RNA sequencing data who seek to bypass extensive data curation and computationally intensive training of traditional foundation models.

How It Works

GenePT utilizes pre-trained embeddings from OpenAI's GPT-3.5 (specifically text-embedding-ada-002 and text-embedding-3-large) applied to NCBI text descriptions of individual genes. Gene embeddings are generated from these descriptions. For cell-level analysis, GenePT creates cell embeddings by either averaging gene embeddings weighted by expression levels or by generating sentence embeddings from gene names ordered by expression. This method avoids the need for dataset curation and additional pre-training, making it efficient and accessible.

Quick Start & Requirements

  • Install/Run: Analysis scripts are provided in the repository. Example notebooks demonstrate usage.
  • Prerequisites: Requires a valid OpenAI API key for embedding generation. Access to specific datasets (e.g., Geneformer, Gene2vec, scGPT-related datasets) may be needed for reproducing paper results.
  • Resources: Embedding generation incurs OpenAI API costs.
  • Links:
    • Analysis scripts: [repository link]
    • Pre-computed embeddings: [link to Zenodo]
    • Tutorials: [repository link]
    • Paper: [bioRxiv link]

Highlighted Details

  • Achieves comparable or superior performance to existing single-cell foundation models on gene property classification and cell type annotation tasks.
  • Demonstrates an effective and straightforward method for building biological foundation models using LLM embeddings of literature.
  • Offers efficient cell embedding generation in under 20 lines of code.
  • Includes examples for batch effect removal while preserving biological information.

Maintenance & Community

The project is associated with authors from academic institutions. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project's code and data usage should be reviewed for compatibility with commercial or closed-source applications.

Limitations & Caveats

The project relies on external OpenAI API services, which may incur costs and are subject to OpenAI's terms of service. Specific datasets used in the paper may require separate download and processing.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

scGPT by bowang-lab

0.3%
1k
Foundation model for single-cell multi-omics research
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.