HyenaDNA provides an official implementation of a long-range genomic foundation model capable of processing up to 1 million tokens at single nucleotide resolution. It is designed for researchers aiming to pretrain HyenaDNA models or fine-tune them for downstream genomic tasks, offering a powerful tool for analyzing extensive DNA sequences.
How It Works
HyenaDNA leverages a novel architecture that replaces traditional self-attention mechanisms with a "Hyena" operator. This operator uses a combination of element-wise multiplications and learned, time-varying filters implemented via FFT convolutions. This approach allows for significantly longer context lengths and improved computational efficiency compared to standard transformer models, making it suitable for processing entire genomes.
Quick Start & Requirements
- Installation: Clone the repository with submodules (
git clone --recurse-submodules
), create a conda environment (conda create -n hyena-dna python=3.8
), install PyTorch 1.13 with CUDA 11.7, and then pip install -r requirements.txt
. Flash Attention installation is also required.
- Dependencies: Python 3.8+, PyTorch 1.13, CUDA 11.7, Flash Attention. Docker images are available for simplified setup.
- Resources: Pretrained models vary in size, with larger models requiring more GPU memory. Processing 1 million tokens may necessitate A100 GPUs.
- Entry Point: A Colab notebook is available for an easy entry point, integrating with Hugging Face for loading pretrained weights and fine-tuning.
- Documentation: Links to arXiv, blog, Colab, Hugging Face, Discord, and YouTube talks are provided.
Highlighted Details
- Supports context lengths up to 1 million tokens at single nucleotide resolution.
- Offers various pretrained model sizes on Hugging Face, trained on the human reference genome (hg38).
- Includes implementations for pretraining, fine-tuning on GenomicBenchmarks and Nucleotide Transformer datasets, and in-context learning.
- Features an experimental bidirectional implementation and options for handling masked tokens.
Maintenance & Community
- The project is active, with links to a Discord server for community interaction and idea sharing.
- Pretrained weights are available on Hugging Face.
Licensing & Compatibility
- The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
- The repository is described as a "work in progress," suggesting potential for ongoing changes and instability.
- Setting up custom dataloaders and configurations can be time-consuming.
- The experimental bidirectional implementation only supports training from scratch on downstream tasks, not masked language model pretraining.