DNABERT_2 by MAGICS-LAB

Foundation model for multi-species genome analysis (ICLR 2024 paper)

Created 2 years ago

447 stars

Top 67.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Andrew Kane

Author of pgvector

Vincent Weisser

Cofounder of Prime Intellect

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

Project Summary

DNABERT-2 provides an efficient foundation model and a comprehensive benchmark for multi-species genome understanding. It is designed for researchers and practitioners in bioinformatics and computational biology seeking state-of-the-art performance on diverse genomic tasks. The project offers a pre-trained model, a benchmark suite (GUE), and tools for fine-tuning on custom datasets.

How It Works

DNABERT-2 replaces traditional k-mer tokenization with Byte Pair Encoding (BPE) for more efficient sequence representation. It also incorporates Attention with Linear Bias (ALiBi) instead of fixed positional embeddings, allowing for better generalization across varying sequence lengths. These architectural improvements, combined with training on large-scale multi-species genomic data, enable state-of-the-art performance on a wide range of genomic tasks.

Quick Start & Requirements

Install: Use pip install -r requirements.txt within a Python 3.8 virtual environment. Optional Triton installation for Flash Attention is available.
Prerequisites: Python 3.8, PyTorch. CUDA is recommended for GPU acceleration.
Usage: Load the model and tokenizer directly from Hugging Face (zhihan1996/DNABERT-2-117M) using the transformers library.
Resources: The GUE benchmark dataset needs to be downloaded separately.
Links: Huggingface ModelHub, GUE Dataset

Highlighted Details

Achieves state-of-the-art performance on the 28-dataset GUE benchmark across 7 tasks and 4 species.
Utilizes BPE tokenization and ALiBi positional embeddings for improved efficiency and effectiveness.
Offers a separate model, DNABERT-S, for generating DNA embeddings that naturally cluster species.
Provides scripts for evaluating on GUE and fine-tuning on custom datasets using DataParallel or DistributedDataParallel.

Maintenance & Community

The project is associated with ICLR 2024 and its primary author is Zhihan Zhou. Issues can be raised on the GitHub repository, and direct contact is available via email.

Licensing & Compatibility

The repository does not explicitly state a license. The underlying models and code may be subject to the licenses of their respective dependencies (e.g., Hugging Face Transformers). Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. The fine-tuning script requires careful adjustment of batch sizes and gradient accumulation steps based on available GPU resources to replicate reported results.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days