EDTA  by oushujun

CLI tool for de-novo transposable element (TE) annotation and benchmarking

created 6 years ago
394 stars

Top 74.2% on sourcepulse

GitHubView on GitHub
Project Summary

EDTA (Extensive de novo TE Annotator) is a comprehensive pipeline for automated, de novo transposable element (TE) annotation across whole genomes. It is designed for researchers and bioinformaticians needing to generate high-quality, non-redundant TE libraries and perform accurate genome-wide TE annotations, offering benchmarking capabilities for new TE libraries and methods.

How It Works

EDTA integrates multiple TE detection tools (e.g., RepeatModeler, LTR_FINDER, HelitronScanner) and employs a multi-step process to filter raw TE candidates, reduce redundancy, and classify TEs. It leverages curated TE libraries and optional CDS information to refine annotations, minimize false positives, and improve the accuracy of TE identification, particularly for under-annotated TE types. The pipeline can also generate masked genome files that exclude TEs from gene annotation regions to improve gene prediction quality.

Quick Start & Requirements

  • Installation: Recommended via conda/mamba (mamba install -c conda-forge -c bioconda edta) or Singularity/Docker for HPC/macOS users. A yml file installation is also available.
  • Prerequisites: Perl, samtools, bedtools, RepeatMasker, RepeatModeler, and various Perl/R packages. Specific dependencies are listed in the README.
  • Usage: perl EDTA.pl --genome genome.fa [options]
  • Resources: Requires significant disk I/O, recommended to run on fast drives (SSD).
  • Documentation: Wiki page for more information and FAQs.

Highlighted Details

  • Automated de novo TE annotation and library generation.
  • Benchmarking tools (lib-test.pl) for comparing TE annotation performance.
  • panEDTA functionality for pan-genome TE analysis.
  • Optional CDS input to filter gene-related sequences and improve TE library quality.

Maintenance & Community

The project is actively developed by The Ou lab at Ohio State University, Deng's Bioinformatics Engineering Team, and Joseph Guhlin's lab. Community support is available via GitHub Issues.

Licensing & Compatibility

The README does not explicitly state a license. However, the inclusion of dependencies like RepeatMasker and RepeatModeler suggests potential licensing considerations for commercial use.

Limitations & Caveats

Sequence names must be short (<=13 characters) and simple. Docker usage has specific path limitations. The pipeline can be resource-intensive, particularly RepeatMasker/RepeatModeler steps. The README does not specify a license, which could impact commercial adoption.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
8
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.