DNABERT  by jerryji1993

DNABERT: pre-trained BERT model for DNA-language analysis

created 5 years ago
682 stars

Top 50.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DNABERT provides pre-trained transformer models for DNA sequence analysis, enabling researchers to leverage language modeling techniques for genomic tasks. It offers implementations for pre-training, fine-tuning, prediction, visualization, and genomic variant analysis, making it a comprehensive tool for computational biologists and bioinformaticians.

How It Works

DNABERT adapts the BERT architecture for DNA sequences by tokenizing them into k-mers. This approach treats DNA sequences as a "language," allowing the model to learn contextual embeddings and patterns. The pre-training phase captures general DNA language properties, while fine-tuning adapts the model to specific downstream tasks like classification or prediction.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment with Python 3.6, and install PyTorch with CUDA 10.0. Then, install the package and its requirements using pip install --editable . and pip install -r requirements.txt within the examples directory.
  • Prerequisites: NVIDIA GPU with Driver Version >= 410.48 (CUDA 10.0 compatible).
  • Resources: Training was performed on 8 NVIDIA GeForce RTX 2080 Ti GPUs with 11GB memory. Adjust batch sizes for different hardware.
  • Links: DNABERT-2

Highlighted Details

  • Offers pre-trained models for k-mers 3, 4, 5, and 6.
  • Includes tools for visualization of attention scores and motif analysis.
  • Supports genomic variant analysis by predicting effects of mutations.
  • Extended from Hugging Face's Transformers library.

Maintenance & Community

The repository is actively under development, with a second generation (DNABERT-2) released in June 2023. Users are encouraged to report issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it is based on Hugging Face's Transformers, which is typically under the Apache 2.0 license.

Limitations & Caveats

The original DNABERT model is limited to sequences of 512 tokens. The README mentions that DNABERT-2 is more efficient and easier to use, suggesting potential improvements over this version. Installation can be complex, with specific CUDA and driver version requirements.

Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
1 more.

BioGPT by microsoft

0.1%
4k
BioGPT is a generative pre-trained transformer for biomedical text
created 3 years ago
updated 1 year ago
Feedback? Help us improve.