GenePT  by yiqunchen

Single-cell foundation model leveraging ChatGPT embeddings for gene/cell biology

created 1 year ago
274 stars

Top 95.2% on sourcepulse

GitHubView on GitHub
Project Summary

GenePT is a foundation model for single-cell biology, offering a user-friendly and efficient approach to gene-level and cell-level tasks by leveraging ChatGPT embeddings of NCBI gene descriptions. It is designed for researchers and bioinformaticians working with single-cell RNA sequencing data who seek to bypass extensive data curation and computationally intensive training of traditional foundation models.

How It Works

GenePT utilizes pre-trained embeddings from OpenAI's GPT-3.5 (specifically text-embedding-ada-002 and text-embedding-3-large) applied to NCBI text descriptions of individual genes. Gene embeddings are generated from these descriptions. For cell-level analysis, GenePT creates cell embeddings by either averaging gene embeddings weighted by expression levels or by generating sentence embeddings from gene names ordered by expression. This method avoids the need for dataset curation and additional pre-training, making it efficient and accessible.

Quick Start & Requirements

  • Install/Run: Analysis scripts are provided in the repository. Example notebooks demonstrate usage.
  • Prerequisites: Requires a valid OpenAI API key for embedding generation. Access to specific datasets (e.g., Geneformer, Gene2vec, scGPT-related datasets) may be needed for reproducing paper results.
  • Resources: Embedding generation incurs OpenAI API costs.
  • Links:
    • Analysis scripts: [repository link]
    • Pre-computed embeddings: [link to Zenodo]
    • Tutorials: [repository link]
    • Paper: [bioRxiv link]

Highlighted Details

  • Achieves comparable or superior performance to existing single-cell foundation models on gene property classification and cell type annotation tasks.
  • Demonstrates an effective and straightforward method for building biological foundation models using LLM embeddings of literature.
  • Offers efficient cell embedding generation in under 20 lines of code.
  • Includes examples for batch effect removal while preserving biological information.

Maintenance & Community

The project is associated with authors from academic institutions. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project's code and data usage should be reviewed for compatibility with commercial or closed-source applications.

Limitations & Caveats

The project relies on external OpenAI API services, which may incur costs and are subject to OpenAI's terms of service. Specific datasets used in the paper may require separate download and processing.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
18 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.