esm  by evolutionaryscale

Protein models & API for generative tasks and representation learning

Created 1 year ago
2,070 stars

Top 21.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides access to EvolutionaryScale's flagship protein language models, ESM3 (generative) and ESM C (representation learning). It's designed for researchers and developers in bioinformatics and computational biology seeking advanced tools for protein sequence, structure, and function prediction and generation. The library offers a unified interface for local execution and cloud-based inference via the EvolutionaryScale Forge API and AWS SageMaker.

How It Works

ESM3 is a multimodal, generative masked language model that reasons across protein sequence, structure, and function. It uses a scalable transformer backbone, allowing iterative generation by sampling masked tokens. ESM C is a parallel representation learning model, designed as a drop-in replacement for ESM2, offering significant performance and efficiency gains. Both models leverage discrete token representations for their respective tasks.

Quick Start & Requirements

  • Install via pip: pip install esm
  • Requires PyTorch and CUDA-enabled GPU for local execution.
  • Hugging Face Hub login is required for model weight downloads.
  • Forge API access requires an API token.
  • SageMaker deployment involves AWS account setup and CloudFormation stack creation.
  • Local model instantiation downloads weights from HuggingFace Hub.
  • See ESM3 Quickstart and ESM C Quickstart for detailed examples.

Highlighted Details

  • ESM3 98B trained with 1.07e24 FLOPs.
  • ESM C 6B outperforms ESM2 15B.
  • Flash Attention support for ESM C via pip install flash-attn.
  • Forge Batch Executor for efficient concurrent processing.

Maintenance & Community

  • Developed by EvolutionaryScale, a public benefit company.
  • Follows a Responsible Development Framework.
  • Citations provided for ESM3 and ESM C models.

Licensing & Compatibility

  • Code and weights are under a mixture of non-commercial and permissive commercial licenses. Refer to LICENSE.md for details.
  • SageMaker deployment is under the Cambrian Inference Clickthrough License Agreement, allowing commercial use.

Limitations & Caveats

  • Local execution requires significant computational resources, especially for larger models.
  • Forge and SageMaker deployments involve external service dependencies and potential costs.
  • Specific model versions and their availability may change.
Health Check
Last Commit

21 hours ago

Responsiveness

1 week

Pull Requests (30d)
7
Issues (30d)
1
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Tri Dao Tri Dao(Chief Scientist at Together AI), and
1 more.

hnet by goombalab

1.5%
722
Hierarchical sequence modeling with dynamic chunking
Created 2 months ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.