evo  by evo-design

DNA foundation model for long-context biological sequence modeling and design

Created 1 year ago
1,412 stars

Top 28.8% on SourcePulse

GitHubView on GitHub
Project Summary

Evo is a biological foundation model designed for long-context sequence modeling and design, spanning from molecular to genome scales. It targets researchers and developers in bioinformatics and synthetic biology, offering capabilities for understanding and generating DNA sequences with unprecedented context lengths.

How It Works

Evo utilizes the StripedHyena architecture, enabling byte-level resolution modeling of DNA sequences with near-linear scaling of compute and memory relative to context length. This approach allows for efficient processing of extremely long sequences, a significant advantage over traditional transformer models that suffer from quadratic complexity. The model is trained on OpenGenome, a large prokaryotic whole-genome dataset.

Quick Start & Requirements

  • Installation: pip install evo-model or from source (git clone then pip install .).
  • Prerequisites: PyTorch (ensure correct version compatibility with FlashAttention-2), FlashAttention-2 (GPU architecture support is critical, check FlashAttention GitHub), and optionally prodigal for specific scripts.
  • Resources: Requires a CUDA-enabled GPU.
  • Documentation: https://github.com/evo-design/evo

Highlighted Details

  • 7 billion parameters, trained on ~300 billion tokens from OpenGenome.
  • Supports context lengths up to 131,072 tokens.
  • Offers various fine-tuned checkpoints for specific tasks (e.g., CRISPR-Cas systems, transposons).
  • Integrated with HuggingFace and available via Together AI API.

Maintenance & Community

The project is associated with the Arc Institute and has published in Science. The README mentions a recent bug fix for inference affecting specific release versions. Further details on Evo 2 are available at https://github.com/arcinstitute/evo2.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

FlashAttention-2, a key dependency, may not be compatible with all GPU architectures. Users must verify compatibility before installation. The project also points to a separate repository for Evo 2, suggesting ongoing development and potential differences.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

hyena-dna by HazyResearch

0.3%
719
Genomic foundation model for long-range DNA sequence modeling
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.