Discover and explore top open-source AI tools and projects—updated daily.
vandijklabLLM framework for single-cell transcriptomics
Top 49.3% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Cell2Sentence (C2S) is a framework that applies Large Language Models (LLMs) to single-cell RNA sequencing (scRNA-seq) data. It addresses the challenge of analyzing complex transcriptomic data by transforming gene expression profiles into "cell sentences"—ordered lists of gene names. This approach allows LLMs to natively process biological data as natural language, enabling advanced downstream tasks such as perturbation prediction, dataset summarization, and biological question answering for researchers and bioinformaticians.
How It Works
The core innovation is the "cell sentence" transformation, which converts scRNA-seq expression vectors into space-separated gene names ordered by descending expression. This representation allows standard LLMs to model transcriptomic data directly. The C2S-Scale framework extends this by incorporating larger models (up to 27B parameters) based on architectures like Pythia and Gemma, unifying transcriptomic and textual data processing. This enables more sophisticated analyses by leveraging the natural language understanding capabilities of LLMs.
Quick Start & Requirements
Installation involves cloning the repository, creating a Conda environment with Python 3.8 (conda create -n cell2sentence python=3.8), activating it (conda activate cell2sentence), and running make install. Alternatively, pip install cell2sentence==1.1.0 can be used. For accelerated inference, especially with long gene sequences, flash-attention can be optionally installed (pip install flash-attn --no-build-isolation). Official documentation and tutorials are available for various workflows.
Highlighted Details
Maintenance & Community
The project is associated with the van Dijk Lab, with links provided to their documentation and blog posts detailing recent developments and collaborations, including with Google Research. Specific community channels (e.g., Discord, Slack) or detailed contributor information are not explicitly listed in the README.
Licensing & Compatibility
The provided README does not explicitly state the software license. This omission requires further investigation for users considering commercial use or integration into closed-source projects.
Limitations & Caveats
The project is under active development, with planned features including support for legacy C2S-GPT-2 model prompts and parameter-efficient finetuning methods like LoRA. These functionalities are not yet implemented, indicating potential gaps for users requiring these specific capabilities.
3 days ago
Inactive
evo-design