The Genome Analysis Toolkit (GATK) is a comprehensive suite of tools for variant discovery and genotyping. It is designed for researchers and bioinformaticians working with large-scale genomic datasets, offering robust and scalable solutions for DNA and RNA sequencing data analysis. GATK4 leverages Apache Spark for parallel processing, enabling efficient analysis on clusters or cloud platforms.
How It Works
GATK4 is built on a unified framework, integrating established tools from GATK and Picard. It utilizes Apache Spark for distributed computing, allowing selected tools to run in a massively parallel fashion. This approach enhances performance and scalability for large genomic datasets, while also introducing new, specialized tools.
Quick Start & Requirements
- Installation: Pre-built executables are available via the GATK website. For building from source, Java 17 JDK, Git 2.5+, git-lfs, and Gradle 5.6 are required. Python 3.10.13 with Conda is needed for the
gatk
frontend script and certain tools. R 4.3.1 is required for plotting.
- Dependencies: Java 17, Python 3.10.13 (via Conda), R 4.3.1. Git LFS is necessary for downloading ~5GB of large test data.
- Running: Use the
./gatk
script. For Spark tools, use --spark-runner
and --spark-master
arguments.
- Documentation: GATK Website
Highlighted Details
- Supports running Spark tools locally, on Spark clusters, or on Google Cloud Dataproc.
- Includes pre-packaged bioinformatics tools (bedtools, samtools, bcftools) and Python/R packages within its Docker image.
- Offers bash tab completion (beta) for command-line usability.
- Provides detailed developer guidelines, including testing strategies and contribution protocols.
Maintenance & Community
- The project is actively maintained by the Broad Institute.
- Discussions and support are available via the GATK Forum.
- Issue tracking is managed via the Issue Tracker.
Licensing & Compatibility
- Licensed under the Apache 2.0 License.
- Permits commercial use and linking with closed-source software.
Limitations & Caveats
- Building from source requires a substantial download (~5GB) for test data via Git LFS.
- Some features, like bash tab completion, are marked as beta.
- Running cloud tests requires specific Google Cloud setup and credentials.