bakta  by oschwengers

CLI tool for rapid bacterial genome annotation

created 5 years ago
538 stars

Top 59.8% on sourcepulse

GitHubView on GitHub
Project Summary

Bakta is a command-line tool for rapid and standardized annotation of bacterial genomes, metagenome-assembled genomes (MAGs), and plasmids. It targets bioinformaticians and researchers needing high-quality, machine-readable annotations that adhere to FAIR principles, facilitating downstream analysis and submission to public databases.

How It Works

Bakta employs an alignment-free sequence identification (AFSI) approach using MD5 protein sequence hash digests to quickly identify identical protein sequences (IPS) against comprehensive UniProt databases. This method bypasses computationally expensive homology searches for known genes, significantly accelerating the annotation process. It integrates multiple expert annotation systems (e.g., AMRFinderPlus, VFDB) and predicts various features including ncRNAs, CRISPRs, and short open reading frames (sORFs), aiming for a balance between speed and annotation depth.

Quick Start & Requirements

  • Install: conda install -c conda-forge -c bioconda bakta or via Docker (podman pull oschwengers/bakta).
  • Prerequisites: Requires external tools like tRNAscan-SE, Aragorn, Infernal, PILER-CR, Pyrodigal, PyHMMER, Diamond, BLAST+, and AMRFinderPlus.
  • Database: A mandatory database download is required (light or full versions available).
  • Resources: Annotating a typical bacterial genome takes ~10 minutes on a laptop. The full database requires ~84 GB unzipped.
  • Docs: https://bakta.computational.bio/

Highlighted Details

  • Fast annotation (~10 min/genome) via alignment-free sequence identification.
  • Comprehensive, taxonomy-independent database derived from UniProt (UniRef clusters).
  • Supports annotation of ncRNAs, sORFs, CRISPRs, and integrates expert systems like AMRFinderPlus.
  • Outputs FAIR-compliant GFF3, GenBank, EMBL, and machine-readable JSON formats.

Maintenance & Community

The project is actively maintained, with contributions from multiple authors. Community interaction and feature requests are encouraged via the GitHub issues page.

Licensing & Compatibility

Bakta is distributed under the MIT license, allowing for commercial use and integration with closed-source software.

Limitations & Caveats

Bakta is specifically designed for bacterial and plasmid genomes; it does not support archaeal or eukaryotic genomes. The prediction of sORFs is subject to strict criteria, with only those identified via IPS/PSC hits and possessing gene symbols or product descriptions being included. DeepSig, used for signal peptide prediction, is not available on macOS.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
34 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.