DNABERT  by jerryji1993

DNABERT: pre-trained BERT model for DNA-language analysis

Created 5 years ago
750 stars

Top 46.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DNABERT provides pre-trained transformer models for DNA sequence analysis, enabling researchers to leverage language modeling techniques for genomic tasks. It offers implementations for pre-training, fine-tuning, prediction, visualization, and genomic variant analysis, making it a comprehensive tool for computational biologists and bioinformaticians.

How It Works

DNABERT adapts the BERT architecture for DNA sequences by tokenizing them into k-mers. This approach treats DNA sequences as a "language," allowing the model to learn contextual embeddings and patterns. The pre-training phase captures general DNA language properties, while fine-tuning adapts the model to specific downstream tasks like classification or prediction.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.6, and install PyTorch with CUDA 10.0. Then, install the package and its requirements using pip install --editable . and pip install -r requirements.txt within the examples directory.
  • Prerequisites: NVIDIA GPU with Driver Version >= 410.48 (CUDA 10.0 compatible).
  • Resources: Training was performed on 8 NVIDIA GeForce RTX 2080 Ti GPUs with 11GB memory. Adjust batch sizes for different hardware.
  • Links: DNABERT-2

Highlighted Details

  • Offers pre-trained models for k-mers 3, 4, 5, and 6.
  • Includes tools for visualization of attention scores and motif analysis.
  • Supports genomic variant analysis by predicting effects of mutations.
  • Extended from Hugging Face's Transformers library.

Maintenance & Community

The repository is actively under development, with a second generation (DNABERT-2) released in June 2023. Users are encouraged to report issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on Hugging Face's Transformers, which is typically under the Apache 2.0 license.

Limitations & Caveats

The original DNABERT model is limited to sequences of 512 tokens. The README mentions that DNABERT-2 is more efficient and easier to use, suggesting potential improvements over this version. Installation can be complex, with specific CUDA and driver version requirements.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

hyena-dna by HazyResearch

0.4%
776
Genomic foundation model for long-range DNA sequence modeling
Created 2 years ago
Updated 11 months ago
Feedback? Help us improve.