Discover and explore top open-source AI tools and projects—updated daily.
NVIDIA-Digital-BioFlow-based protein generator for de novo design
Top 96.9% on SourcePulse
Proteina is a novel, large-scale flow-based protein backbone generator, presented as an ICLR 2025 Oral Paper. It addresses the challenge of de novo protein design by enabling the generation of diverse, designable protein structures at unprecedented lengths, offering researchers fine-grained control through hierarchical conditioning on CATH fold class labels and a tailored scalable transformer architecture.
How It Works
Proteina employs a flow-based generative model built upon a scalable transformer architecture, significantly increasing parameter count over prior work. Its core innovation lies in hierarchical conditioning using CATH fold class labels, enabling precise control over generated protein structures. This approach allows for guidance at secondary structure levels and generation of specific folds. The model also incorporates advanced training and sampling techniques, including LoRA fine-tuning, classifier-free guidance, and autoguidance, alongside new metrics for evaluating distributional similarity of generated protein sets.
Quick Start & Requirements
Environment setup requires mamba or conda to create an environment from environment.yaml, followed by conda activate proteina_env and pip install -e .. A .env file with DATA_PATH=/directory/where/you/store/files is mandatory. Users must download and unpack proteina_additional_files.zip and proteina_training_data_indices.zip, organizing them under DATA_PATH as specified. Links to the paper and project page are available in the repository.
Highlighted Details
Maintenance & Community
No specific details regarding community channels (e.g., Discord, Slack), active maintainers beyond the author list, or roadmap are provided in the README.
Licensing & Compatibility
The source code, model weights, and auxiliary files are released under an NVIDIA license, strictly for non-commercial or research purposes. Commercial use or linking with closed-source applications is not permitted under this license.
Limitations & Caveats
The use of fold_class_mappings_C_selected_A_T_cath_codes.pth for generating long proteins (300-800 residues) with specific CATH codes is noted as experimental, as many fold types may not naturally occur at these lengths. Conditional generation for specific CATH codes is limited to plausible length-fold combinations observed in training data.
10 months ago
Inactive
agemagician
Biohub