Mol-Instructions by zjunlp

Biomolecular instruction dataset for LLMs, targeting molecule/protein properties and NLP tasks

Created 2 years ago

291 stars

Top 90.7% on SourcePulse

Project Summary

Mol-Instructions is a large-scale dataset and collection of fine-tuned models designed to enhance the capabilities of Large Language Models (LLMs) in biomolecular tasks. It targets researchers and developers in bioinformatics, chemoinformatics, and computational biology, offering a structured way to improve LLM performance on complex molecular and protein-related instructions.

How It Works

The dataset is built through a combination of human-AI collaboration for task description creation, derivation from existing specialized databases, and template-based conversion of structured biological data into textual instructions. This multi-pronged approach ensures a diverse and comprehensive instruction set covering molecule-oriented tasks (description generation, design, reaction prediction), protein-oriented tasks (design, function/activity prediction, domain identification), and biomolecular text tasks (entity recognition, relation extraction, Q&A).

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/zjunlp/Mol-Instruction).
Dependencies: Python, PyTorch, Hugging Face Transformers, Gradio (for demo). Specific model weights are available on Hugging Face.
Usage: Fine-tuned models are provided for LLaMA and LLaMA2. Usage examples and a Gradio demo are available.

Highlighted Details

Over 700K instructions across molecule, protein, and biomolecular text domains.
Fine-tuned models demonstrate significant performance improvements over general LLMs on various biomolecular tasks, including molecular generation and protein understanding.
Supports tasks like molecule description generation, retrosynthesis, protein function prediction, and chemical entity recognition.
Includes quantitative experiments and benchmarks comparing fine-tuned models against general LLMs and other specialized models.

Maintenance & Community

The project is associated with Zjunlp and has had regular updates, including results on LLaMA3 and the release of related projects like ChatCell. The primary paper is accepted at ICLR 2024.

Licensing & Compatibility

The dataset is licensed under CC BY 4.0. All data and model weights are exclusively licensed for research purposes. Usage that may lead to harm or detriment to society is strictly forbidden.

Limitations & Caveats

The current models are preliminary demonstrations and have limited capacity for real-world, production-grade tasks. A significant amount of instruction data remains to be collected.

Mol-Instructions by zjunlp

Explore Similar Projects

ProteinWorkshop by a-r-j

ProstT5 by mheinzinger

Scientific-LLM-Survey by HICAI-ZJU

LucaOne by LucaOne

awesome-protein-representation-learning by LirongWu

UniRep by churchlab

protein_bert by nadavbra

cell2sentence by vandijklab

papers_for_protein_design_using_DL by Peldom

Machine-learning-for-proteins by yangkky

PaddleHelix by PaddlePaddle

RFdiffusion by RosettaCommons