proteina  by NVIDIA-Digital-Bio

Flow-based protein generator for de novo design

Created 1 year ago
262 stars

Top 96.9% on SourcePulse

GitHubView on GitHub
Project Summary

Proteina is a novel, large-scale flow-based protein backbone generator, presented as an ICLR 2025 Oral Paper. It addresses the challenge of de novo protein design by enabling the generation of diverse, designable protein structures at unprecedented lengths, offering researchers fine-grained control through hierarchical conditioning on CATH fold class labels and a tailored scalable transformer architecture.

How It Works

Proteina employs a flow-based generative model built upon a scalable transformer architecture, significantly increasing parameter count over prior work. Its core innovation lies in hierarchical conditioning using CATH fold class labels, enabling precise control over generated protein structures. This approach allows for guidance at secondary structure levels and generation of specific folds. The model also incorporates advanced training and sampling techniques, including LoRA fine-tuning, classifier-free guidance, and autoguidance, alongside new metrics for evaluating distributional similarity of generated protein sets.

Quick Start & Requirements

Environment setup requires mamba or conda to create an environment from environment.yaml, followed by conda activate proteina_env and pip install -e .. A .env file with DATA_PATH=/directory/where/you/store/files is mandatory. Users must download and unpack proteina_additional_files.zip and proteina_training_data_indices.zip, organizing them under DATA_PATH as specified. Links to the paper and project page are available in the repository.

Highlighted Details

  • Achieves state-of-the-art performance in de novo protein backbone design.
  • Generates diverse, designable proteins up to 800 residues in length.
  • Hierarchical CATH fold class conditioning offers novel control over protein structure generation.
  • Introduces new distributional similarity metrics (FPSD, adapted from FID) for evaluating generated protein sets.

Maintenance & Community

No specific details regarding community channels (e.g., Discord, Slack), active maintainers beyond the author list, or roadmap are provided in the README.

Licensing & Compatibility

The source code, model weights, and auxiliary files are released under an NVIDIA license, strictly for non-commercial or research purposes. Commercial use or linking with closed-source applications is not permitted under this license.

Limitations & Caveats

The use of fold_class_mappings_C_selected_A_T_cath_codes.pth for generating long proteins (300-800 residues) with specific CATH codes is noted as experimental, as many fold types may not naturally occur at these lengths. Conditional generation for specific CATH codes is limited to plausible length-fold combinations observed in training data.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.