Clair3 by HKU-BAL

Deep learning variant caller for long-read sequencing

Created 5 years ago

378 stars

Top 74.9% on SourcePulse

Project Summary

Summary

Clair3 is a deep learning-based variant caller designed for long-read sequencing data. It addresses the challenge of accurately identifying germline small variants by harmonizing two distinct calling strategies: a fast pileup-based approach for broad candidate identification and a precise full-alignment model for complex cases. This dual-model architecture offers superior performance, particularly at lower sequencing coverages, making it a valuable tool for researchers and bioinformaticians seeking high recall and precision in variant detection.

How It Works

Clair3 employs a novel architecture that integrates both pileup and full-alignment deep learning models. The pileup model efficiently processes summarized alignment statistics to identify a majority of variant candidates. For candidates requiring higher confidence or exhibiting complexity, a computationally intensive, haplotype-resolved full-alignment model is applied. This synergistic approach balances computational speed with maximal precision and recall, outperforming previous generations and offering significant error reduction.

Quick Start & Requirements

Installation can be achieved via Mamba/Conda (recommended for GPU/Apple Silicon acceleration), Docker (CPU only), or Singularity (CPU only). Key dependencies include Python 3.11, Samtools (>=1.10), PyTorch, and potentially CUDA for GPU acceleration. Pre-trained PyTorch models for various platforms (ONT, PacBio HiFi, Illumina) are available for download. GPU acceleration offers approximately a 5x speedup over CPU.

Highlighted Details

v2.0.0 (Feb 2026): Migrated deep learning framework from TensorFlow to PyTorch. Introduced signal-aware variant calling using Nanopore dwell time features (requires Dorado mv tags).
Performance: Achieved 99.69% SNP F1-score and 80.58% Indel F1-score on HG003 85x ONT data, reducing SNP errors by ~78% and Indel errors by ~48% compared to Clair.
Efficiency: Processes 50x WGS ONT data in ~8 hours on 36 CPU cores (~4x faster than PEPPER, ~14x faster than Medaka). Processes 35x WGS PacBio HiFi data in ~2 hours (13x faster than DeepVariant).
Features: Supports GVCF output, offers multiple phasing options (WhatsHap, LongPhase), and includes specialized modes for amplicon data and haploid calling.

Maintenance & Community

The project lists contact emails for Ruibang Luo, Zhenxian Zheng, and Xian Yu. Several recent updates highlight contributions from various individuals (e.g., @Devon Ryan, @Sam Nicholls, @William Shropshire), indicating active development and community involvement in bug fixes and feature enhancements. No specific community channels (like Slack or Discord) are listed.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. This omission requires further investigation for users considering commercial use or integration into closed-source projects.

Limitations & Caveats

Docker and Singularity images are limited to CPU execution; GPU or Apple Silicon acceleration necessitates a Mamba/Conda installation. TensorFlow models from Clair3 v1 are incompatible with the PyTorch-based v2.0. The --enable_variant_calling_at_sequence_head_and_tail option, while useful for amplicon data, should be used cautiously due to potentially less reliable alignments in those regions.

Clair3 by HKU-BAL

Explore Similar Projects

OmniGenBench by COLA-Laboratory

TriForce by Infini-AI-Lab

scikit-fingerprints by MLCIL

ChatLearn by alibaba

bionemo-agent-toolkit by NVIDIA-BioNeMo

awesome-bioinformatics-benchmarks by j-andrews7

bakta by oschwengers

UniRep by churchlab

bionemo-recipes by NVIDIA-BioNeMo

evo by evo-design

deeplearning-biology by hussius

gatk by broadinstitute