molformer  by IBM

Chemical language model for property prediction and feature extraction

Created 3 years ago
372 stars

Top 76.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Large-scale chemical language representations capture molecular structure and properties, addressing the challenge of limited labeled data in drug discovery and material design. This project provides PyTorch code and data for MoLFormer, a Transformer-based model trained on over a billion molecules represented as SMILES strings. It enables researchers and engineers to leverage powerful, pre-trained molecular embeddings for downstream property prediction tasks, outperforming existing baselines.

How It Works

MoLFormer utilizes a Transformer architecture adapted for chemical sequences (SMILES strings), employing Masked Language Modeling (MLM) for self-supervised pre-training on massive datasets like PubChem and Zinc. Key innovations include a linear attention mechanism for efficiency and rotary positional embeddings to capture interatomic spatial relationships. This approach allows the model to learn compressed, meaningful molecular representations that generalize well to various downstream prediction tasks after fine-tuning.

Quick Start & Requirements

  • Installation: Requires compiling the apex library from source due to its optimizer. Detailed environment setup is provided in environment.md.
  • Prerequisites: Python, PyTorch, RDKit (for data canonicalization), and Apex. Tested on Nvidia V100 GPUs.
  • Data: Requires downloading Pretrained MoLFormer.zip and finetune_datasets.zip from https://ibm.box.com/v/MoLFormer-data and extracting them into a data/ directory with specific sub-directory structures for pre-training and fine-tuning datasets.

Highlighted Details

  • Provides checkpoints for a model pre-trained on ~100 million molecules, demonstrating competitive performance on MoleculeNet benchmarks.
  • Attention analysis confirms the model learns spatial relationships between atoms within molecules.
  • Includes notebooks for using pre-trained models as feature extractors and for visualizing attention patterns.

Maintenance & Community

The repository is associated with a Nature Machine Intelligence publication, indicating a research-driven origin. No specific community channels (e.g., Discord, Slack) or active maintenance indicators are detailed in the README.

Licensing & Compatibility

The provided README does not specify a software license. This lack of explicit licensing information may pose compatibility concerns for commercial use or integration into closed-source projects.

Limitations & Caveats

The distributed checkpoints are not the full MoLFormer-XL versions. To analyze full attention mechanisms, users must train a new model, as the provided pre-trained model uses linear attention. Data preprocessing requires specific canonicalization via RDKit and adherence to strict directory structures. The dependency on a source-compiled Apex library can complicate setup.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena), and
13 more.

awesome-tensor-compilers by merrymercy

0.1%
3k
Curated list of tensor compiler projects and papers
Created 5 years ago
Updated 1 year ago
Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 5 years ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

simpletransformers by ThilinaRajapakse

0%
4k
Rapid NLP task implementation
Created 6 years ago
Updated 4 months ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 2 days ago
Starred by Vaibhav Nivargi Vaibhav Nivargi(Cofounder of Moveworks), Chuan Li Chuan Li(Chief Scientific Officer at Lambda), and
5 more.

awesome-mlops by visenger

0.1%
14k
Curated MLOps knowledge hub
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.