bio_embeddings  by sacdallago

Python library for generating protein embeddings from sequences

created 6 years ago
495 stars

Top 63.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a unified interface and reproducible workflows for generating protein embeddings from sequences using various deep learning models. It targets researchers and developers needing to leverage these embeddings for downstream tasks like transfer learning, visualization, and property prediction, simplifying complex model integration and offering abstraction for resource management.

How It Works

The library offers a pipeline that converts protein sequences into per-amino-acid or per-sequence embeddings. It supports a wide array of pre-trained models (e.g., SeqVec, ProtTrans, UniRep, ESM) and provides tools for dimensionality reduction (UMAP, t-SNE) and visualization of these embeddings. The pipeline abstracts away model-specific complexities, including CUDA out-of-memory errors, and offers robust error handling.

Quick Start & Requirements

  • Install: pip install bio-embeddings[all] or via Docker (ghcr.io/bioembeddings/bio_embeddings:v0.1.6).
  • Prerequisites: Unix-like environment with GPU and CUDA recommended for optimal performance. mmseqs2 is required for mmseqs_search protocol.
  • Resources: Model weights are cached locally. GPU memory requirements vary by model.
  • Docs: docs.bioembeddings.com

Highlighted Details

  • Supports over a dozen embedding models including ProtTrans (BERT, T5, XLNet), ESM, and SeqVec.
  • Includes pipeline for sequence alignment and property extraction (secondary structure, localization).
  • Offers interactive 2D/3D visualization of embedding spaces.
  • Provides a distributed API via a webserver for scalable workflows.

Maintenance & Community

  • Key contributors include Christian Dallago, Konstantin Schütze, Tobias Olenyi, and Michael Heinzinger.
  • Community chat available at chat.bioembeddings.com.
  • Presentations at ISMB 2020 & LMRL 2020; YouTube talk available.

Licensing & Compatibility

  • The project itself appears to be under a permissive license, but specific model licenses are not detailed. Users should verify compatibility for commercial use.

Limitations & Caveats

  • Performance is significantly impacted on systems without GPU and CUDA. Windows users are advised to use WSL. The README notes potential inconsistencies on non-Unix setups.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.