hyena-dna by HazyResearch

Genomic foundation model for long-range DNA sequence modeling

Created 2 years ago

748 stars

Top 46.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Johannes Hagemann

Cofounder of Prime Intellect

Phil Wang

Prolific Research Paper Implementer

and 1 more!

Project Summary

HyenaDNA provides an official implementation of a long-range genomic foundation model capable of processing up to 1 million tokens at single nucleotide resolution. It is designed for researchers aiming to pretrain HyenaDNA models or fine-tune them for downstream genomic tasks, offering a powerful tool for analyzing extensive DNA sequences.

How It Works

HyenaDNA leverages a novel architecture that replaces traditional self-attention mechanisms with a "Hyena" operator. This operator uses a combination of element-wise multiplications and learned, time-varying filters implemented via FFT convolutions. This approach allows for significantly longer context lengths and improved computational efficiency compared to standard transformer models, making it suitable for processing entire genomes.

Quick Start & Requirements

Installation: Clone the repository with submodules (git clone --recurse-submodules), create a conda environment (conda create -n hyena-dna python=3.8), install PyTorch 1.13 with CUDA 11.7, and then pip install -r requirements.txt. Flash Attention installation is also required.
Dependencies: Python 3.8+, PyTorch 1.13, CUDA 11.7, Flash Attention. Docker images are available for simplified setup.
Resources: Pretrained models vary in size, with larger models requiring more GPU memory. Processing 1 million tokens may necessitate A100 GPUs.
Entry Point: A Colab notebook is available for an easy entry point, integrating with Hugging Face for loading pretrained weights and fine-tuning.
Documentation: Links to arXiv, blog, Colab, Hugging Face, Discord, and YouTube talks are provided.

Highlighted Details

Supports context lengths up to 1 million tokens at single nucleotide resolution.
Offers various pretrained model sizes on Hugging Face, trained on the human reference genome (hg38).
Includes implementations for pretraining, fine-tuning on GenomicBenchmarks and Nucleotide Transformer datasets, and in-context learning.
Features an experimental bidirectional implementation and options for handling masked tokens.

Maintenance & Community

The project is active, with links to a Discord server for community interaction and idea sharing.
Pretrained weights are available on Hugging Face.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The repository is described as a "work in progress," suggesting potential for ongoing changes and instability.
Setting up custom dataloaders and configurations can be time-consuming.
The experimental bidirectional implementation only supports training from scratch on downstream tasks, not masked language model pretraining.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days