DNABERT: pre-trained BERT model for DNA-language analysis
Top 50.7% on sourcepulse
DNABERT provides pre-trained transformer models for DNA sequence analysis, enabling researchers to leverage language modeling techniques for genomic tasks. It offers implementations for pre-training, fine-tuning, prediction, visualization, and genomic variant analysis, making it a comprehensive tool for computational biologists and bioinformaticians.
How It Works
DNABERT adapts the BERT architecture for DNA sequences by tokenizing them into k-mers. This approach treats DNA sequences as a "language," allowing the model to learn contextual embeddings and patterns. The pre-training phase captures general DNA language properties, while fine-tuning adapts the model to specific downstream tasks like classification or prediction.
Quick Start & Requirements
pip install --editable .
and pip install -r requirements.txt
within the examples
directory.Highlighted Details
Maintenance & Community
The repository is actively under development, with a second generation (DNABERT-2) released in June 2023. Users are encouraged to report issues.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, it is based on Hugging Face's Transformers, which is typically under the Apache 2.0 license.
Limitations & Caveats
The original DNABERT model is limited to sequences of 512 tokens. The README mentions that DNABERT-2 is more efficient and easier to use, suggesting potential improvements over this version. Installation can be complex, with specific CUDA and driver version requirements.
3 weeks ago
1 week