nucleotide-transformer  by instadeepai

Genomics foundation models & segmentation tools

created 2 years ago
672 stars

Top 51.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides foundation models for genomics and transcriptomics, including the Nucleotide Transformer (NT) and Agro Nucleotide Transformer (AgroNT) for genomic language modeling, and SegmentNT, SegmentEnformer, and SegmentBorzoi for single-nucleotide resolution genomic element segmentation. It targets researchers and practitioners in bioinformatics and computational biology, offering pre-trained weights and inference code to accelerate genomic analysis and discovery.

How It Works

The core of the project utilizes transformer architectures adapted for DNA sequences. Nucleotide Transformers process DNA by tokenizing sequences into 6-mers, leveraging large-scale pre-training on diverse human and multi-species genomes. SegmentNT models build upon these transformers by replacing the language model head with a U-Net segmentation head, enabling precise localization of genomic features. Nucleotide Transformer v2 models incorporate architectural improvements like Rotary Embeddings and Gated Linear Units for enhanced efficiency and longer context windows.

Quick Start & Requirements

  • Install via pip install .
  • Requires JAX, which supports both GPU and TPU.
  • Example notebooks are available on HuggingFace Spaces for fine-tuning and inference.

Highlighted Details

  • Offers 9 Nucleotide Transformer models and 2 segmentation models with pre-trained weights.
  • Nucleotide Transformers are trained on over 3,200 human genomes and 850 species genomes.
  • SegmentNT models demonstrate zero-shot generalization up to 50kbp and achieve state-of-the-art performance.
  • Nucleotide Transformer v2 models support sequences up to 12kbp context window.

Maintenance & Community

  • Developed in collaboration with Nvidia, TUM, and Google.
  • Associated papers are available for citation.
  • Contact information is provided for questions and feedback.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • SegmentNT models cannot process sequences containing "N" bases due to the 6-mer tokenization requirement.
  • Handling of sequences not divisible by 6 requires specific tokenization strategies.
Health Check
Last commit

3 weeks ago

Responsiveness

1+ week

Pull Requests (30d)
2
Issues (30d)
2
Star History
74 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

hyena-dna by HazyResearch

0%
704
Genomic foundation model for long-range DNA sequence modeling
created 2 years ago
updated 3 months ago
Feedback? Help us improve.