HapHiC  by zengxiaofei

Fast, reference-independent genome scaffolding using Hi-C data

Created 2 years ago
254 stars

Top 99.0% on SourcePulse

GitHubView on GitHub
Project Summary

HapHiC is a fast, reference-independent, and allele-aware scaffolding tool designed to construct chromosome-scale pseudomolecules from Hi-C data for haplotype-phased genome assemblies. It offers a significant advantage by not requiring a reference genome and demonstrating superior tolerance to low sequencing depth and assembly errors compared to existing methods, making it suitable for complex diploid and allopolyploid genomes.

How It Works

HapHiC employs a multi-stage approach starting with contig clustering using the Markov Cluster (MCL) algorithm, which implicitly controls cluster numbers and handles multi-chromosomal groups more effectively than traditional AHC methods. It then integrates an optimized 3D-DNA iterative scaffolding algorithm for rapid ordering and orientation, followed by ALLHiC for refinement. This combination allows for efficient, allele-aware scaffolding without reference genomes, improving contig assignment, ordering, and orientation.

Quick Start & Requirements

  • Installation: Recommended via Conda (conda env create -f HapHiC/conda_env/environment_py310.yml).
  • Prerequisites: Linux environment with Intel Xeon, AMD EPYC, or Hygon C86 CPUs. Conda environments for Python 3.10-3.12.
  • Resource Footprint: Varies significantly by genome size; examples range from ~13 min/2 GiB RAM for human genomes to ~7 hours/135 GiB RAM for large plant genomes.
  • Key Links: GitHub: https://github.com/zengxiaofei/HapHiC.git.

Highlighted Details

  • Reference-independent, allele-aware scaffolding for haplotype-phased assemblies.
  • Outperforms alternatives in tolerance to low contig N50, low Hi-C depth, and assembly errors.
  • "Super-fast" and memory-efficient; capable of scaffolding most genomes within an hour on 8 cores.
  • Supports "quick view" mode for unknown chromosome counts and manual curation, and can order/orient contigs without prior chromosome number knowledge.

Maintenance & Community

HapHiC is actively maintained with regular updates, including version 1.0.7 (March 2025) and several releases in 2024 addressing stability and feature enhancements. Issues can be reported via GitHub Issues. No community forums (e.g., Discord/Slack) are linked.

Licensing & Compatibility

  • License: No explicit software license is stated in the README.
  • Compatibility: Primarily Linux. Compatible with haplotype-phased assemblies (e.g., from hifiasm) and can scaffold collapsed diploid/allopolyploid assemblies.

Limitations & Caveats

The Bioconda version is not officially maintained and has known issues. Support for contigs longer than 2^31-1 bp (introduced in v1.0.7) may be limited by upstream/downstream tool compatibility. Automatic parameter tuning in the one-line command may require manual intervention, with quick view mode serving as an alternative. The lack of an explicit license may restrict commercial use or integration into proprietary software.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.1%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.