tape  by songlab-cal

Protein embedding benchmark for semi-supervised learning tasks

Created 6 years ago
727 stars

Top 47.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TAPE (Tasks Assessing Protein Embeddings) provides a comprehensive benchmark suite for evaluating protein language models. It offers a pretraining corpus, five downstream tasks (secondary structure prediction, contact prediction, remote homology detection, fluorescence, and stability), pretrained model weights, and benchmarking code, targeting researchers and practitioners in bioinformatics and computational biology.

How It Works

TAPE leverages a PyTorch-based framework, integrating with the Hugging Face API for easy loading of various pretrained protein models, including Transformer, UniRep, and trRosetta. It supports both unsupervised pretraining on large datasets like Pfam and supervised fine-tuning on specific biological tasks, enabling robust evaluation of transfer learning capabilities in protein representation learning.

Quick Start & Requirements

  • Install: pip install tape_proteins
  • Prerequisites: Python, PyTorch. GPU recommended for training.
  • Data: Downloadable via download_data.sh or specified paths. Unsupervised Pfam dataset is ~19GB uncompressed.
  • Docs: https://github.com/songlab-cal/tape

Highlighted Details

  • Provides Hugging Face API integration for seamless model loading and caching.
  • Includes a tape-embed command for generating protein embeddings from FASTA files.
  • Supports distributed training with features like half-precision and gradient accumulation.
  • Offers a re-implementation of the trRosetta model with provided PyTorch code and data.

Maintenance & Community

The project's README notes that direct training with TAPE's code is no longer recommended due to compatibility issues with newer PyTorch versions, suggesting the use of frameworks like PyTorch Lightning or Fairseq. The original TensorFlow repository is maintained separately.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the project's components and datasets are intended for research use, with specific citation requirements for all data sources used.

Limitations & Caveats

The project explicitly states that this PyTorch version is not intended for maximum compatibility with the original paper's results, recommending the TensorFlow version for reproducibility. Training code is not actively maintained for future PyTorch versions, and issues related to multi-GPU or OOM errors during training are not being fixed. Documentation is also noted as incomplete.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
7 more.

lingua by facebookresearch

0.0%
5k
LLM research codebase for training and inference
Created 1 year ago
Updated 5 months ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.1%
7k
Framework for training large-scale autoregressive language models
Created 5 years ago
Updated 1 month ago
Feedback? Help us improve.