DNABERT  by jerryji1993

DNABERT: pre-trained BERT model for DNA-language analysis

Created 5 years ago
695 stars

Top 49.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DNABERT provides pre-trained transformer models for DNA sequence analysis, enabling researchers to leverage language modeling techniques for genomic tasks. It offers implementations for pre-training, fine-tuning, prediction, visualization, and genomic variant analysis, making it a comprehensive tool for computational biologists and bioinformaticians.

How It Works

DNABERT adapts the BERT architecture for DNA sequences by tokenizing them into k-mers. This approach treats DNA sequences as a "language," allowing the model to learn contextual embeddings and patterns. The pre-training phase captures general DNA language properties, while fine-tuning adapts the model to specific downstream tasks like classification or prediction.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.6, and install PyTorch with CUDA 10.0. Then, install the package and its requirements using pip install --editable . and pip install -r requirements.txt within the examples directory.
  • Prerequisites: NVIDIA GPU with Driver Version >= 410.48 (CUDA 10.0 compatible).
  • Resources: Training was performed on 8 NVIDIA GeForce RTX 2080 Ti GPUs with 11GB memory. Adjust batch sizes for different hardware.
  • Links: DNABERT-2

Highlighted Details

  • Offers pre-trained models for k-mers 3, 4, 5, and 6.
  • Includes tools for visualization of attention scores and motif analysis.
  • Supports genomic variant analysis by predicting effects of mutations.
  • Extended from Hugging Face's Transformers library.

Maintenance & Community

The repository is actively under development, with a second generation (DNABERT-2) released in June 2023. Users are encouraged to report issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on Hugging Face's Transformers, which is typically under the Apache 2.0 license.

Limitations & Caveats

The original DNABERT model is limited to sequences of 512 tokens. The README mentions that DNABERT-2 is more efficient and easier to use, suggesting potential improvements over this version. Installation can be complex, with specific CUDA and driver version requirements.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

hyena-dna by HazyResearch

0.3%
719
Genomic foundation model for long-range DNA sequence modeling
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.