Clair3  by HKU-BAL

Deep learning variant caller for long-read sequencing

Created 5 years ago
345 stars

Top 80.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Clair3 is a deep learning-based variant caller designed for long-read sequencing data. It addresses the challenge of accurately identifying germline small variants by harmonizing two distinct calling strategies: a fast pileup-based approach for broad candidate identification and a precise full-alignment model for complex cases. This dual-model architecture offers superior performance, particularly at lower sequencing coverages, making it a valuable tool for researchers and bioinformaticians seeking high recall and precision in variant detection.

How It Works

Clair3 employs a novel architecture that integrates both pileup and full-alignment deep learning models. The pileup model efficiently processes summarized alignment statistics to identify a majority of variant candidates. For candidates requiring higher confidence or exhibiting complexity, a computationally intensive, haplotype-resolved full-alignment model is applied. This synergistic approach balances computational speed with maximal precision and recall, outperforming previous generations and offering significant error reduction.

Quick Start & Requirements

Installation can be achieved via Mamba/Conda (recommended for GPU/Apple Silicon acceleration), Docker (CPU only), or Singularity (CPU only). Key dependencies include Python 3.11, Samtools (>=1.10), PyTorch, and potentially CUDA for GPU acceleration. Pre-trained PyTorch models for various platforms (ONT, PacBio HiFi, Illumina) are available for download. GPU acceleration offers approximately a 5x speedup over CPU.

Highlighted Details

  • v2.0.0 (Feb 2026): Migrated deep learning framework from TensorFlow to PyTorch. Introduced signal-aware variant calling using Nanopore dwell time features (requires Dorado mv tags).
  • Performance: Achieved 99.69% SNP F1-score and 80.58% Indel F1-score on HG003 85x ONT data, reducing SNP errors by ~78% and Indel errors by ~48% compared to Clair.
  • Efficiency: Processes 50x WGS ONT data in ~8 hours on 36 CPU cores (~4x faster than PEPPER, ~14x faster than Medaka). Processes 35x WGS PacBio HiFi data in ~2 hours (13x faster than DeepVariant).
  • Features: Supports GVCF output, offers multiple phasing options (WhatsHap, LongPhase), and includes specialized modes for amplicon data and haploid calling.

Maintenance & Community

The project lists contact emails for Ruibang Luo, Zhenxian Zheng, and Xian Yu. Several recent updates highlight contributions from various individuals (e.g., @Devon Ryan, @Sam Nicholls, @William Shropshire), indicating active development and community involvement in bug fixes and feature enhancements. No specific community channels (like Slack or Discord) are listed.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. This omission requires further investigation for users considering commercial use or integration into closed-source projects.

Limitations & Caveats

Docker and Singularity images are limited to CPU execution; GPU or Apple Silicon acceleration necessitates a Mamba/Conda installation. TensorFlow models from Clair3 v1 are incompatible with the PyTorch-based v2.0. The --enable_variant_calling_at_sequence_head_and_tail option, while useful for amplicon data, should be used cautiously due to potentially less reliable alignments in those regions.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.1%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.