EDTA by oushujun

CLI tool for de-novo transposable element (TE) annotation and benchmarking

Created 6 years ago

430 stars

Top 69.0% on SourcePulse

Project Summary

EDTA (Extensive de novo TE Annotator) is a comprehensive pipeline for automated, de novo transposable element (TE) annotation across whole genomes. It is designed for researchers and bioinformaticians needing to generate high-quality, non-redundant TE libraries and perform accurate genome-wide TE annotations, offering benchmarking capabilities for new TE libraries and methods.

How It Works

EDTA integrates multiple TE detection tools (e.g., RepeatModeler, LTR_FINDER, HelitronScanner) and employs a multi-step process to filter raw TE candidates, reduce redundancy, and classify TEs. It leverages curated TE libraries and optional CDS information to refine annotations, minimize false positives, and improve the accuracy of TE identification, particularly for under-annotated TE types. The pipeline can also generate masked genome files that exclude TEs from gene annotation regions to improve gene prediction quality.

Quick Start & Requirements

Installation: Recommended via conda/mamba (mamba install -c conda-forge -c bioconda edta) or Singularity/Docker for HPC/macOS users. A yml file installation is also available.
Prerequisites: Perl, samtools, bedtools, RepeatMasker, RepeatModeler, and various Perl/R packages. Specific dependencies are listed in the README.
Usage: perl EDTA.pl --genome genome.fa [options]
Resources: Requires significant disk I/O, recommended to run on fast drives (SSD).
Documentation: Wiki page for more information and FAQs.

Highlighted Details

Automated de novo TE annotation and library generation.
Benchmarking tools (lib-test.pl) for comparing TE annotation performance.
panEDTA functionality for pan-genome TE analysis.
Optional CDS input to filter gene-related sequences and improve TE library quality.

Maintenance & Community

The project is actively developed by The Ou lab at Ohio State University, Deng's Bioinformatics Engineering Team, and Joseph Guhlin's lab. Community support is available via GitHub Issues.

Licensing & Compatibility

The README does not explicitly state a license. However, the inclusion of dependencies like RepeatMasker and RepeatModeler suggests potential licensing considerations for commercial use.

Limitations & Caveats

Sequence names must be short (<=13 characters) and simple. Docker usage has specific path limitations. The pipeline can be resource-intensive, particularly RepeatMasker/RepeatModeler steps. The README does not specify a license, which could impact commercial adoption.

EDTA by oushujun

Explore Similar Projects

OmniGenBench by COLA-Laboratory

Awesome-Bio-Foundation-Models by apeterswu

prodigy-openai-recipes by explosion

GenePT by yiqunchen

awesome-open-data-annotation by zenml-io

awesome-bioinformatics-benchmarks by j-andrews7

bakta by oschwengers

BRAKER by Gaius-Augustus

DNABERT_2 by MAGICS-LAB

omicverse by Starlitnightly

Training-modules by hbctraining

DNABERT by jerryji1993