esm by evolutionaryscale

Protein models & API for generative tasks and representation learning

Created 1 year ago

2,198 stars

Top 20.3% on SourcePulse

3 Experts Love This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

xiaofan-luan

VP Engineering at Zilliz

vincentweisser

Vincent Weisser

Cofounder of Prime Intellect

Project Summary

This repository provides access to EvolutionaryScale's flagship protein language models, ESM3 (generative) and ESM C (representation learning). It's designed for researchers and developers in bioinformatics and computational biology seeking advanced tools for protein sequence, structure, and function prediction and generation. The library offers a unified interface for local execution and cloud-based inference via the EvolutionaryScale Forge API and AWS SageMaker.

How It Works

ESM3 is a multimodal, generative masked language model that reasons across protein sequence, structure, and function. It uses a scalable transformer backbone, allowing iterative generation by sampling masked tokens. ESM C is a parallel representation learning model, designed as a drop-in replacement for ESM2, offering significant performance and efficiency gains. Both models leverage discrete token representations for their respective tasks.

Quick Start & Requirements

Install via pip: pip install esm
Requires PyTorch and CUDA-enabled GPU for local execution.
Hugging Face Hub login is required for model weight downloads.
Forge API access requires an API token.
SageMaker deployment involves AWS account setup and CloudFormation stack creation.
Local model instantiation downloads weights from HuggingFace Hub.
See ESM3 Quickstart and ESM C Quickstart for detailed examples.

Highlighted Details

ESM3 98B trained with 1.07e24 FLOPs.
ESM C 6B outperforms ESM2 15B.
Flash Attention support for ESM C via pip install flash-attn.
Forge Batch Executor for efficient concurrent processing.

Maintenance & Community

Developed by EvolutionaryScale, a public benefit company.
Follows a Responsible Development Framework.
Citations provided for ESM3 and ESM C models.

Licensing & Compatibility

Code and weights are under a mixture of non-commercial and permissive commercial licenses. Refer to LICENSE.md for details.
SageMaker deployment is under the Cambrian Inference Clickthrough License Agreement, allowing commercial use.

Limitations & Caveats

Local execution requires significant computational resources, especially for larger models.
Forge and SageMaker deployments involve external service dependencies and potential costs.
Specific model versions and their availability may change.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

1

Issues (30d)

1

Star History

29 stars in the last 30 days

Explore Similar Projects

ProteinWorkshop by a-r-j

Benchmarking framework for protein representation learning

Created 2 years ago

Updated 8 months ago

ProstT5 by mheinzinger

Bilingual protein language model for sequence/structure translation

Created 2 years ago

Updated 1 year ago

protein-sequence-models by microsoft

PyTorch modules for modeling biological sequence data

Created 4 years ago

Updated 1 year ago

Starred by

Akshat Bubna

Akshat Bubna(Cofounder of Modal),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

1 more.

ArcticTraining by snowflakedb

LLM post-training acceleration framework

Created 1 year ago

Updated 1 day ago

LucaOne by LucaOne

Foundation model for biological sequences

Created 1 year ago

Updated 3 days ago

awesome-protein-representation-learning by LirongWu

Paper list for protein representation learning

Created 3 years ago

Updated 1 year ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Tri Dao

Tri Dao(Chief Scientist at Together AI), and

1 more.

hnet by goombalab

Hierarchical sequence modeling with dynamic chunking

Created 6 months ago

Updated 1 month ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

UniRep by churchlab

mLSTM for protein engineering informatics

Created 7 years ago

Updated 3 years ago

protein_bert by nadavbra

Protein language model for protein-related tasks

Created 4 years ago

Updated 9 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

tape by songlab-cal

Protein embedding benchmark for semi-supervised learning tasks

Created 6 years ago

Updated 3 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Phil Wang

Phil Wang(Prolific Research Paper Implementer).

ProtTrans by agemagician

Pre-trained protein language models for bioinformatics & COVID-19 research

Created 5 years ago

Updated 7 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect), and

2 more.

evo by evo-design

DNA foundation model for long-context biological sequence modeling and design

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.