UniRep  by churchlab

mLSTM for protein engineering informatics

created 6 years ago
355 stars

Top 79.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

UniRep provides a deep representation learning model (mLSTM "babbler") for protein engineering informatics, enabling training, inference, and generative modeling of protein sequences. It targets researchers and practitioners in bioinformatics and computational biology, offering pre-trained models and tools for efficient protein sequence analysis and design.

How It Works

UniRep utilizes a multi-layer LSTM (mLSTM) architecture, specifically designed for learning representations from protein sequences. This approach allows it to capture complex evolutionary and structural relationships within proteins. The model is trained on large protein datasets, enabling it to generate meaningful embeddings that can be used for various downstream tasks like prediction and generation.

Quick Start & Requirements

  • Install/Run: Docker is recommended. Build CPU: docker build -f docker/Dockerfile.cpu -t unirep-cpu . Run CPU: docker/run_cpu_docker.sh. Build GPU: docker build -f docker/Dockerfile.gpu -t unirep-gpu . Run GPU: docker/run_gpu_docker.sh.
  • Prerequisites: NVIDIA CUDA 8.0, cuDNN 6.0.21, NVIDIA GPU Driver 410.79 (or compatible), nvidia-docker for GPU support. AWS CLI is needed for direct weight downloads.
  • Resources: The 64-unit model runs on most machines. The full 1900-unit model requires >16GB GPU RAM. Training/finetuning is memory-intensive, with input sequence length being the primary determinant.
  • Docs: unirep_tutorial.ipynb

Highlighted Details

  • Offers three model sizes (64, 256, 1900 units) with pre-trained and randomly initialized weights for reproducibility.
  • Includes tools for data management, training, inference, and generative modeling ("babbling").
  • Finetuned weights on fluorescent proteins are available.

Maintenance & Community

  • Developed by churchlab.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • Model weights are licensed under Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
  • Code is licensed under GPL v3.
  • The non-commercial clause on model weights restricts use in proprietary or commercial applications.

Limitations & Caveats

The project relies on TensorFlow 1.3, which is significantly outdated. The CC BY-NC 4.0 license for model weights prohibits commercial use.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Feedback? Help us improve.