bio_embeddings by sacdallago

Python library for generating protein embeddings from sequences

Created 6 years ago

505 stars

Top 61.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This project provides a unified interface and reproducible workflows for generating protein embeddings from sequences using various deep learning models. It targets researchers and developers needing to leverage these embeddings for downstream tasks like transfer learning, visualization, and property prediction, simplifying complex model integration and offering abstraction for resource management.

How It Works

The library offers a pipeline that converts protein sequences into per-amino-acid or per-sequence embeddings. It supports a wide array of pre-trained models (e.g., SeqVec, ProtTrans, UniRep, ESM) and provides tools for dimensionality reduction (UMAP, t-SNE) and visualization of these embeddings. The pipeline abstracts away model-specific complexities, including CUDA out-of-memory errors, and offers robust error handling.

Quick Start & Requirements

Install: pip install bio-embeddings[all] or via Docker (ghcr.io/bioembeddings/bio_embeddings:v0.1.6).
Prerequisites: Unix-like environment with GPU and CUDA recommended for optimal performance. mmseqs2 is required for mmseqs_search protocol.
Resources: Model weights are cached locally. GPU memory requirements vary by model.
Docs: docs.bioembeddings.com

Highlighted Details

Supports over a dozen embedding models including ProtTrans (BERT, T5, XLNet), ESM, and SeqVec.
Includes pipeline for sequence alignment and property extraction (secondary structure, localization).
Offers interactive 2D/3D visualization of embedding spaces.
Provides a distributed API via a webserver for scalable workflows.

Maintenance & Community

Key contributors include Christian Dallago, Konstantin Schütze, Tobias Olenyi, and Michael Heinzinger.
Community chat available at chat.bioembeddings.com.
Presentations at ISMB 2020 & LMRL 2020; YouTube talk available.

Licensing & Compatibility

The project itself appears to be under a permissive license, but specific model licenses are not detailed. Users should verify compatibility for commercial use.

Limitations & Caveats

Performance is significantly impacted on systems without GPU and CUDA. Windows users are advised to use WSL. The README notes potential inconsistencies on non-Unix setups.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days