BERT-like models for chemical SMILES data, drug design, and chemical modeling
Top 66.2% on sourcepulse
ChemBERTa provides HuggingFace-compatible models for chemical informatics tasks, enabling drug design and property prediction using transformer architectures on SMILES strings. It targets developers, students, and researchers in computational chemistry and machine learning.
How It Works
The project leverages RoBERTa, a BERT variant, trained on masked language modeling (MLM) over large chemical datasets like ZINC and PubChem. This approach allows the models to learn rich representations of molecular structures encoded as SMILES strings, facilitating downstream tasks.
Quick Start & Requirements
transformers
library.transformers
, PyTorch.AutoModelWithLMHead.from_pretrained
and AutoTokenizer.from_pretrained
. Example: pipeline('fill-mask', model=model, tokenizer=tokenizer)
.Highlighted Details
Maintenance & Community
The project is associated with Seyone Chithrananda and has plans for integration with DeepChem. Further updates and releases are anticipated following formal publications.
Licensing & Compatibility
Limitations & Caveats
The repository is primarily a collection of notebooks; full model implementation and attention visualization code are pending formal publication. Support for a wider array of property prediction tasks is in progress.
9 months ago
Inactive