Discover and explore top open-source AI tools and projects—updated daily.
IBMChemical language model for property prediction and feature extraction
Top 78.4% on SourcePulse
Large-scale chemical language representations capture molecular structure and properties, addressing the challenge of limited labeled data in drug discovery and material design. This project provides PyTorch code and data for MoLFormer, a Transformer-based model trained on over a billion molecules represented as SMILES strings. It enables researchers and engineers to leverage powerful, pre-trained molecular embeddings for downstream property prediction tasks, outperforming existing baselines.
How It Works
MoLFormer utilizes a Transformer architecture adapted for chemical sequences (SMILES strings), employing Masked Language Modeling (MLM) for self-supervised pre-training on massive datasets like PubChem and Zinc. Key innovations include a linear attention mechanism for efficiency and rotary positional embeddings to capture interatomic spatial relationships. This approach allows the model to learn compressed, meaningful molecular representations that generalize well to various downstream prediction tasks after fine-tuning.
Quick Start & Requirements
apex library from source due to its optimizer. Detailed environment setup is provided in environment.md.Pretrained MoLFormer.zip and finetune_datasets.zip from https://ibm.box.com/v/MoLFormer-data and extracting them into a data/ directory with specific sub-directory structures for pre-training and fine-tuning datasets.Highlighted Details
Maintenance & Community
The repository is associated with a Nature Machine Intelligence publication, indicating a research-driven origin. No specific community channels (e.g., Discord, Slack) or active maintenance indicators are detailed in the README.
Licensing & Compatibility
The provided README does not specify a software license. This lack of explicit licensing information may pose compatibility concerns for commercial use or integration into closed-source projects.
Limitations & Caveats
The distributed checkpoints are not the full MoLFormer-XL versions. To analyze full attention mechanisms, users must train a new model, as the provided pre-trained model uses linear attention. Data preprocessing requires specific canonicalization via RDKit and adherence to strict directory structures. The dependency on a source-compiled Apex library can complicate setup.
1 month ago
Inactive
 Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), 
google
grahamjenson
google-research
triton-inference-server
tensorflow