DNABERT_2  by MAGICS-LAB

Foundation model for multi-species genome analysis (ICLR 2024 paper)

created 2 years ago
404 stars

Top 72.9% on sourcepulse

GitHubView on GitHub
Project Summary

DNABERT-2 provides an efficient foundation model and a comprehensive benchmark for multi-species genome understanding. It is designed for researchers and practitioners in bioinformatics and computational biology seeking state-of-the-art performance on diverse genomic tasks. The project offers a pre-trained model, a benchmark suite (GUE), and tools for fine-tuning on custom datasets.

How It Works

DNABERT-2 replaces traditional k-mer tokenization with Byte Pair Encoding (BPE) for more efficient sequence representation. It also incorporates Attention with Linear Bias (ALiBi) instead of fixed positional embeddings, allowing for better generalization across varying sequence lengths. These architectural improvements, combined with training on large-scale multi-species genomic data, enable state-of-the-art performance on a wide range of genomic tasks.

Quick Start & Requirements

  • Install: Use pip install -r requirements.txt within a Python 3.8 virtual environment. Optional Triton installation for Flash Attention is available.
  • Prerequisites: Python 3.8, PyTorch. CUDA is recommended for GPU acceleration.
  • Usage: Load the model and tokenizer directly from Hugging Face (zhihan1996/DNABERT-2-117M) using the transformers library.
  • Resources: The GUE benchmark dataset needs to be downloaded separately.
  • Links: Huggingface ModelHub, GUE Dataset

Highlighted Details

  • Achieves state-of-the-art performance on the 28-dataset GUE benchmark across 7 tasks and 4 species.
  • Utilizes BPE tokenization and ALiBi positional embeddings for improved efficiency and effectiveness.
  • Offers a separate model, DNABERT-S, for generating DNA embeddings that naturally cluster species.
  • Provides scripts for evaluating on GUE and fine-tuning on custom datasets using DataParallel or DistributedDataParallel.

Maintenance & Community

The project is associated with ICLR 2024 and its primary author is Zhihan Zhou. Issues can be raised on the GitHub repository, and direct contact is available via email.

Licensing & Compatibility

The repository does not explicitly state a license. The underlying models and code may be subject to the licenses of their respective dependencies (e.g., Hugging Face Transformers). Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. The fine-tuning script requires careful adjustment of batch sizes and gradient accumulation steps based on available GPU resources to replicate reported results.

Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
33 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Feedback? Help us improve.