DNABERT by jerryji1993

DNABERT: pre-trained BERT model for DNA-language analysis

Created 5 years ago

730 stars

Top 47.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

DNABERT provides pre-trained transformer models for DNA sequence analysis, enabling researchers to leverage language modeling techniques for genomic tasks. It offers implementations for pre-training, fine-tuning, prediction, visualization, and genomic variant analysis, making it a comprehensive tool for computational biologists and bioinformaticians.

How It Works

DNABERT adapts the BERT architecture for DNA sequences by tokenizing them into k-mers. This approach treats DNA sequences as a "language," allowing the model to learn contextual embeddings and patterns. The pre-training phase captures general DNA language properties, while fine-tuning adapts the model to specific downstream tasks like classification or prediction.

Quick Start & Requirements

Install: Clone the repository, create a conda environment with Python 3.6, and install PyTorch with CUDA 10.0. Then, install the package and its requirements using pip install --editable . and pip install -r requirements.txt within the examples directory.
Prerequisites: NVIDIA GPU with Driver Version >= 410.48 (CUDA 10.0 compatible).
Resources: Training was performed on 8 NVIDIA GeForce RTX 2080 Ti GPUs with 11GB memory. Adjust batch sizes for different hardware.
Links: DNABERT-2

Highlighted Details

Offers pre-trained models for k-mers 3, 4, 5, and 6.
Includes tools for visualization of attention scores and motif analysis.
Supports genomic variant analysis by predicting effects of mutations.
Extended from Hugging Face's Transformers library.

Maintenance & Community

The repository is actively under development, with a second generation (DNABERT-2) released in June 2023. Users are encouraged to report issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on Hugging Face's Transformers, which is typically under the Apache 2.0 license.

Limitations & Caveats

The original DNABERT model is limited to sequences of 512 tokens. The README mentions that DNABERT-2 is more efficient and easier to use, suggesting potential improvements over this version. Installation can be complex, with specific CUDA and driver version requirements.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days