Foundation model for multi-species genome analysis (ICLR 2024 paper)
Top 72.9% on sourcepulse
DNABERT-2 provides an efficient foundation model and a comprehensive benchmark for multi-species genome understanding. It is designed for researchers and practitioners in bioinformatics and computational biology seeking state-of-the-art performance on diverse genomic tasks. The project offers a pre-trained model, a benchmark suite (GUE), and tools for fine-tuning on custom datasets.
How It Works
DNABERT-2 replaces traditional k-mer tokenization with Byte Pair Encoding (BPE) for more efficient sequence representation. It also incorporates Attention with Linear Bias (ALiBi) instead of fixed positional embeddings, allowing for better generalization across varying sequence lengths. These architectural improvements, combined with training on large-scale multi-species genomic data, enable state-of-the-art performance on a wide range of genomic tasks.
Quick Start & Requirements
pip install -r requirements.txt
within a Python 3.8 virtual environment. Optional Triton installation for Flash Attention is available.zhihan1996/DNABERT-2-117M
) using the transformers
library.Highlighted Details
Maintenance & Community
The project is associated with ICLR 2024 and its primary author is Zhihan Zhou. Issues can be raised on the GitHub repository, and direct contact is available via email.
Licensing & Compatibility
The repository does not explicitly state a license. The underlying models and code may be subject to the licenses of their respective dependencies (e.g., Hugging Face Transformers). Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README does not specify a license, which may impact commercial adoption. The fine-tuning script requires careful adjustment of batch sizes and gradient accumulation steps based on available GPU resources to replicate reported results.
7 months ago
Inactive