Biomolecular instruction dataset for LLMs, targeting molecule/protein properties and NLP tasks
Top 93.5% on sourcepulse
Mol-Instructions is a large-scale dataset and collection of fine-tuned models designed to enhance the capabilities of Large Language Models (LLMs) in biomolecular tasks. It targets researchers and developers in bioinformatics, chemoinformatics, and computational biology, offering a structured way to improve LLM performance on complex molecular and protein-related instructions.
How It Works
The dataset is built through a combination of human-AI collaboration for task description creation, derivation from existing specialized databases, and template-based conversion of structured biological data into textual instructions. This multi-pronged approach ensures a diverse and comprehensive instruction set covering molecule-oriented tasks (description generation, design, reaction prediction), protein-oriented tasks (design, function/activity prediction, domain identification), and biomolecular text tasks (entity recognition, relation extraction, Q&A).
Quick Start & Requirements
git clone https://github.com/zjunlp/Mol-Instruction
).Highlighted Details
Maintenance & Community
The project is associated with Zjunlp and has had regular updates, including results on LLaMA3 and the release of related projects like ChatCell. The primary paper is accepted at ICLR 2024.
Licensing & Compatibility
The dataset is licensed under CC BY 4.0. All data and model weights are exclusively licensed for research purposes. Usage that may lead to harm or detriment to society is strictly forbidden.
Limitations & Caveats
The current models are preliminary demonstrations and have limited capacity for real-world, production-grade tasks. A significant amount of instruction data remains to be collected.
9 months ago
1 week