cell2sentence by vandijklab

LLM framework for single-cell transcriptomics

Created 1 year ago

812 stars

Top 43.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Phil Wang

Prolific Research Paper Implementer

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Cell2Sentence (C2S) is a framework that applies Large Language Models (LLMs) to single-cell RNA sequencing (scRNA-seq) data. It addresses the challenge of analyzing complex transcriptomic data by transforming gene expression profiles into "cell sentences"—ordered lists of gene names. This approach allows LLMs to natively process biological data as natural language, enabling advanced downstream tasks such as perturbation prediction, dataset summarization, and biological question answering for researchers and bioinformaticians.

How It Works

The core innovation is the "cell sentence" transformation, which converts scRNA-seq expression vectors into space-separated gene names ordered by descending expression. This representation allows standard LLMs to model transcriptomic data directly. The C2S-Scale framework extends this by incorporating larger models (up to 27B parameters) based on architectures like Pythia and Gemma, unifying transcriptomic and textual data processing. This enables more sophisticated analyses by leveraging the natural language understanding capabilities of LLMs.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.8 (conda create -n cell2sentence python=3.8), activating it (conda activate cell2sentence), and running make install. Alternatively, pip install cell2sentence==1.1.0 can be used. For accelerated inference, especially with long gene sequences, flash-attention can be optionally installed (pip install flash-attn --no-build-isolation). Official documentation and tutorials are available for various workflows.

Highlighted Details

Supports C2S-Scale models up to 27 billion parameters, based on Pythia and Gemma architectures.
Unifies transcriptomic and textual data for comprehensive analysis.
Enables advanced tasks: perturbation prediction, dataset summarization, cluster captioning, and biological question answering.
Pretrained models are available on Huggingface, trained on large datasets including CellxGene and Human Cell Atlas (over 57 million cells).

Maintenance & Community

The project is associated with the van Dijk Lab, with links provided to their documentation and blog posts detailing recent developments and collaborations, including with Google Research. Specific community channels (e.g., Discord, Slack) or detailed contributor information are not explicitly listed in the README.

Licensing & Compatibility

The provided README does not explicitly state the software license. This omission requires further investigation for users considering commercial use or integration into closed-source projects.

Limitations & Caveats

The project is under active development, with planned features including support for legacy C2S-GPT-2 model prompts and parameter-efficient finetuning methods like LoRA. These functionalities are not yet implemented, indicating potential gaps for users requiring these specific capabilities.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days