nucleotide-transformer  by instadeepai

Genomics foundation models & segmentation tools

Created 2 years ago
722 stars

Top 47.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides foundation models for genomics and transcriptomics, including the Nucleotide Transformer (NT) and Agro Nucleotide Transformer (AgroNT) for genomic language modeling, and SegmentNT, SegmentEnformer, and SegmentBorzoi for single-nucleotide resolution genomic element segmentation. It targets researchers and practitioners in bioinformatics and computational biology, offering pre-trained weights and inference code to accelerate genomic analysis and discovery.

How It Works

The core of the project utilizes transformer architectures adapted for DNA sequences. Nucleotide Transformers process DNA by tokenizing sequences into 6-mers, leveraging large-scale pre-training on diverse human and multi-species genomes. SegmentNT models build upon these transformers by replacing the language model head with a U-Net segmentation head, enabling precise localization of genomic features. Nucleotide Transformer v2 models incorporate architectural improvements like Rotary Embeddings and Gated Linear Units for enhanced efficiency and longer context windows.

Quick Start & Requirements

  • Install via pip install .
  • Requires JAX, which supports both GPU and TPU.
  • Example notebooks are available on HuggingFace Spaces for fine-tuning and inference.

Highlighted Details

  • Offers 9 Nucleotide Transformer models and 2 segmentation models with pre-trained weights.
  • Nucleotide Transformers are trained on over 3,200 human genomes and 850 species genomes.
  • SegmentNT models demonstrate zero-shot generalization up to 50kbp and achieve state-of-the-art performance.
  • Nucleotide Transformer v2 models support sequences up to 12kbp context window.

Maintenance & Community

  • Developed in collaboration with Nvidia, TUM, and Google.
  • Associated papers are available for citation.
  • Contact information is provided for questions and feedback.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • SegmentNT models cannot process sequences containing "N" bases due to the 6-mer tokenization requirement.
  • Handling of sequences not divisible by 6 requires specific tokenization strategies.
Health Check
Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
1
Issues (30d)
1
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.