cell2sentence  by vandijklab

LLM framework for single-cell transcriptomics

Created 1 year ago
690 stars

Top 49.3% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Cell2Sentence (C2S) is a framework that applies Large Language Models (LLMs) to single-cell RNA sequencing (scRNA-seq) data. It addresses the challenge of analyzing complex transcriptomic data by transforming gene expression profiles into "cell sentences"—ordered lists of gene names. This approach allows LLMs to natively process biological data as natural language, enabling advanced downstream tasks such as perturbation prediction, dataset summarization, and biological question answering for researchers and bioinformaticians.

How It Works

The core innovation is the "cell sentence" transformation, which converts scRNA-seq expression vectors into space-separated gene names ordered by descending expression. This representation allows standard LLMs to model transcriptomic data directly. The C2S-Scale framework extends this by incorporating larger models (up to 27B parameters) based on architectures like Pythia and Gemma, unifying transcriptomic and textual data processing. This enables more sophisticated analyses by leveraging the natural language understanding capabilities of LLMs.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.8 (conda create -n cell2sentence python=3.8), activating it (conda activate cell2sentence), and running make install. Alternatively, pip install cell2sentence==1.1.0 can be used. For accelerated inference, especially with long gene sequences, flash-attention can be optionally installed (pip install flash-attn --no-build-isolation). Official documentation and tutorials are available for various workflows.

Highlighted Details

  • Supports C2S-Scale models up to 27 billion parameters, based on Pythia and Gemma architectures.
  • Unifies transcriptomic and textual data for comprehensive analysis.
  • Enables advanced tasks: perturbation prediction, dataset summarization, cluster captioning, and biological question answering.
  • Pretrained models are available on Huggingface, trained on large datasets including CellxGene and Human Cell Atlas (over 57 million cells).

Maintenance & Community

The project is associated with the van Dijk Lab, with links provided to their documentation and blog posts detailing recent developments and collaborations, including with Google Research. Specific community channels (e.g., Discord, Slack) or detailed contributor information are not explicitly listed in the README.

Licensing & Compatibility

The provided README does not explicitly state the software license. This omission requires further investigation for users considering commercial use or integration into closed-source projects.

Limitations & Caveats

The project is under active development, with planned features including support for legacy C2S-GPT-2 model prompts and parameter-efficient finetuning methods like LoRA. These functionalities are not yet implemented, indicating potential gaps for users requiring these specific capabilities.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
497 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.