tape  by songlab-cal

Protein embedding benchmark for semi-supervised learning tasks

created 5 years ago
712 stars

Top 49.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TAPE (Tasks Assessing Protein Embeddings) provides a comprehensive benchmark suite for evaluating protein language models. It offers a pretraining corpus, five downstream tasks (secondary structure prediction, contact prediction, remote homology detection, fluorescence, and stability), pretrained model weights, and benchmarking code, targeting researchers and practitioners in bioinformatics and computational biology.

How It Works

TAPE leverages a PyTorch-based framework, integrating with the Hugging Face API for easy loading of various pretrained protein models, including Transformer, UniRep, and trRosetta. It supports both unsupervised pretraining on large datasets like Pfam and supervised fine-tuning on specific biological tasks, enabling robust evaluation of transfer learning capabilities in protein representation learning.

Quick Start & Requirements

  • Install: pip install tape_proteins
  • Prerequisites: Python, PyTorch. GPU recommended for training.
  • Data: Downloadable via download_data.sh or specified paths. Unsupervised Pfam dataset is ~19GB uncompressed.
  • Docs: https://github.com/songlab-cal/tape

Highlighted Details

  • Provides Hugging Face API integration for seamless model loading and caching.
  • Includes a tape-embed command for generating protein embeddings from FASTA files.
  • Supports distributed training with features like half-precision and gradient accumulation.
  • Offers a re-implementation of the trRosetta model with provided PyTorch code and data.

Maintenance & Community

The project's README notes that direct training with TAPE's code is no longer recommended due to compatibility issues with newer PyTorch versions, suggesting the use of frameworks like PyTorch Lightning or Fairseq. The original TensorFlow repository is maintained separately.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the project's components and datasets are intended for research use, with specific citation requirements for all data sources used.

Limitations & Caveats

The project explicitly states that this PyTorch version is not intended for maximum compatibility with the original paper's results, recommending the TensorFlow version for reproducibility. Training code is not actively maintained for future PyTorch versions, and issues related to multi-GPU or OOM errors during training are not being fixed. Documentation is also noted as incomplete.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
258
Efficiently train foundation models with PyTorch
created 1 year ago
updated 1 week ago
Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
5 more.

gpt-neo by EleutherAI

0.0%
8k
GPT-2/3-style model implementation using mesh-tensorflow
created 5 years ago
updated 3 years ago
Feedback? Help us improve.