bert-loves-chemistry  by seyonechithrananda

BERT-like models for chemical SMILES data, drug design, and chemical modeling

created 5 years ago
465 stars

Top 66.2% on sourcepulse

GitHubView on GitHub
Project Summary

ChemBERTa provides HuggingFace-compatible models for chemical informatics tasks, enabling drug design and property prediction using transformer architectures on SMILES strings. It targets developers, students, and researchers in computational chemistry and machine learning.

How It Works

The project leverages RoBERTa, a BERT variant, trained on masked language modeling (MLM) over large chemical datasets like ZINC and PubChem. This approach allows the models to learn rich representations of molecular structures encoded as SMILES strings, facilitating downstream tasks.

Quick Start & Requirements

  • Install: Use HuggingFace's transformers library.
  • Prerequisites: Python, HuggingFace transformers, PyTorch.
  • Usage: Load models and tokenizers via AutoModelWithLMHead.from_pretrained and AutoTokenizer.from_pretrained. Example: pipeline('fill-mask', model=model, tokenizer=tokenizer).
  • Resources: Pre-trained weights are available on HuggingFace.

Highlighted Details

  • Models pre-trained on ZINC 100k, 250k, PubChem 100k, 250k, 1M, and 10M datasets.
  • Includes notebooks for pre-training and fine-tuning setups.
  • Planned release of attention visualization suite and DeepChem integration.
  • Ongoing work to release larger models and support more property prediction tasks.

Maintenance & Community

The project is associated with Seyone Chithrananda and has plans for integration with DeepChem. Further updates and releases are anticipated following formal publications.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The repository is primarily a collection of notebooks; full model implementation and attention visualization code are pending formal publication. Support for a wider array of property prediction tasks is in progress.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

SwissArmyTransformer by THUDM

0.3%
1k
Transformer library for flexible model development
created 3 years ago
updated 7 months ago
Feedback? Help us improve.