celltypist by Teichlab

Cell type classifier for scRNA-seq datasets using logistic regression

Created 4 years ago

453 stars

Top 66.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

CellTypist is a Python package for semi-automatic cell type annotation of single-cell RNA sequencing (scRNA-seq) data. It leverages logistic regression classifiers trained on reference datasets to predict cell identities in query datasets, offering both automated and multi-label classification modes. The tool is designed for researchers and bioinformaticians working with scRNA-seq data who need to annotate cell populations efficiently and accurately.

How It Works

CellTypist employs logistic regression models, optimized via stochastic gradient descent, to classify cells. Users can utilize pre-built models (e.g., for immune cell subtypes) or train custom models from their own reference data. The core functionality involves taking a gene expression count matrix (cell-by-gene or gene-by-cell) or an AnnData object as input and returning predicted cell type labels, decision scores, and probabilities. An optional majority voting mechanism can refine predictions by considering cell-cell transcriptomic relationships within clusters.

Quick Start & Requirements

Install: pip install celltypist or conda install -c bioconda -c conda-forge celltypist
Prerequisites: Python 3.x. Models are downloaded on demand and are typically ~1MB each.
Usage: Detailed examples for Python API and command-line interface are provided. See CellTypist website for interactive tutorials.

Highlighted Details

Supports classification using pre-built or custom-trained models.
Offers 'best match' and 'prob match' modes for single or multi-label classification.
Includes a majority voting classifier to leverage cell-cell transcriptomic similarity.
Provides functionality for creating custom models and cross-species/gene ID conversion.

Maintenance & Community

The project is associated with Teichlab. Further community engagement details are not explicitly listed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility with commercial or closed-source projects is not specified.

Limitations & Caveats

The README notes that for subsetting models, retraining the original reference data is a more accurate approach than using the subset method. Cross-species conversion relies on ortholog mapping files, and the default uses Ensembl version 105. The tool is primarily Python-based, with no direct R compatibility mentioned.

Health Check

Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days