evo  by evo-design

DNA foundation model for long-context biological sequence modeling and design

created 1 year ago
1,393 stars

Top 29.6% on sourcepulse

GitHubView on GitHub
Project Summary

Evo is a biological foundation model designed for long-context sequence modeling and design, spanning from molecular to genome scales. It targets researchers and developers in bioinformatics and synthetic biology, offering capabilities for understanding and generating DNA sequences with unprecedented context lengths.

How It Works

Evo utilizes the StripedHyena architecture, enabling byte-level resolution modeling of DNA sequences with near-linear scaling of compute and memory relative to context length. This approach allows for efficient processing of extremely long sequences, a significant advantage over traditional transformer models that suffer from quadratic complexity. The model is trained on OpenGenome, a large prokaryotic whole-genome dataset.

Quick Start & Requirements

  • Installation: pip install evo-model or from source (git clone then pip install .).
  • Prerequisites: PyTorch (ensure correct version compatibility with FlashAttention-2), FlashAttention-2 (GPU architecture support is critical, check FlashAttention GitHub), and optionally prodigal for specific scripts.
  • Resources: Requires a CUDA-enabled GPU.
  • Documentation: https://github.com/evo-design/evo

Highlighted Details

  • 7 billion parameters, trained on ~300 billion tokens from OpenGenome.
  • Supports context lengths up to 131,072 tokens.
  • Offers various fine-tuned checkpoints for specific tasks (e.g., CRISPR-Cas systems, transposons).
  • Integrated with HuggingFace and available via Together AI API.

Maintenance & Community

The project is associated with the Arc Institute and has published in Science. The README mentions a recent bug fix for inference affecting specific release versions. Further details on Evo 2 are available at https://github.com/arcinstitute/evo2.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

FlashAttention-2, a key dependency, may not be compatible with all GPU architectures. Users must verify compatibility before installation. The project also points to a separate repository for Evo 2, suggesting ongoing development and potential differences.

Health Check
Last commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
2
Star History
39 stars in the last 90 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), Nathan Lambert Nathan Lambert(AI Researcher at AI2), and
1 more.

unified-io-2 by allenai

0.3%
619
Unified-IO 2 code for training, inference, and demo
created 1 year ago
updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.