gatk  by broadinstitute

GATK is a genome analysis toolkit

created 10 years ago
1,823 stars

Top 24.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

The Genome Analysis Toolkit (GATK) is a comprehensive suite of tools for variant discovery and genotyping. It is designed for researchers and bioinformaticians working with large-scale genomic datasets, offering robust and scalable solutions for DNA and RNA sequencing data analysis. GATK4 leverages Apache Spark for parallel processing, enabling efficient analysis on clusters or cloud platforms.

How It Works

GATK4 is built on a unified framework, integrating established tools from GATK and Picard. It utilizes Apache Spark for distributed computing, allowing selected tools to run in a massively parallel fashion. This approach enhances performance and scalability for large genomic datasets, while also introducing new, specialized tools.

Quick Start & Requirements

  • Installation: Pre-built executables are available via the GATK website. For building from source, Java 17 JDK, Git 2.5+, git-lfs, and Gradle 5.6 are required. Python 3.10.13 with Conda is needed for the gatk frontend script and certain tools. R 4.3.1 is required for plotting.
  • Dependencies: Java 17, Python 3.10.13 (via Conda), R 4.3.1. Git LFS is necessary for downloading ~5GB of large test data.
  • Running: Use the ./gatk script. For Spark tools, use --spark-runner and --spark-master arguments.
  • Documentation: GATK Website

Highlighted Details

  • Supports running Spark tools locally, on Spark clusters, or on Google Cloud Dataproc.
  • Includes pre-packaged bioinformatics tools (bedtools, samtools, bcftools) and Python/R packages within its Docker image.
  • Offers bash tab completion (beta) for command-line usability.
  • Provides detailed developer guidelines, including testing strategies and contribution protocols.

Maintenance & Community

  • The project is actively maintained by the Broad Institute.
  • Discussions and support are available via the GATK Forum.
  • Issue tracking is managed via the Issue Tracker.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Permits commercial use and linking with closed-source software.

Limitations & Caveats

  • Building from source requires a substantial download (~5GB) for test data via Git LFS.
  • Some features, like bash tab completion, are marked as beta.
  • Running cloud tests requires specific Google Cloud setup and credentials.
Health Check
Last commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
23
Issues (30d)
4
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
8 more.

higgsfield by higgsfield-ai

0.3%
3k
ML framework for large model training and GPU orchestration
created 7 years ago
updated 1 year ago
Feedback? Help us improve.