Mol-Instructions  by zjunlp

Biomolecular instruction dataset for LLMs, targeting molecule/protein properties and NLP tasks

created 2 years ago
282 stars

Top 93.5% on sourcepulse

GitHubView on GitHub
Project Summary

Mol-Instructions is a large-scale dataset and collection of fine-tuned models designed to enhance the capabilities of Large Language Models (LLMs) in biomolecular tasks. It targets researchers and developers in bioinformatics, chemoinformatics, and computational biology, offering a structured way to improve LLM performance on complex molecular and protein-related instructions.

How It Works

The dataset is built through a combination of human-AI collaboration for task description creation, derivation from existing specialized databases, and template-based conversion of structured biological data into textual instructions. This multi-pronged approach ensures a diverse and comprehensive instruction set covering molecule-oriented tasks (description generation, design, reaction prediction), protein-oriented tasks (design, function/activity prediction, domain identification), and biomolecular text tasks (entity recognition, relation extraction, Q&A).

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/zjunlp/Mol-Instruction).
  • Dependencies: Python, PyTorch, Hugging Face Transformers, Gradio (for demo). Specific model weights are available on Hugging Face.
  • Usage: Fine-tuned models are provided for LLaMA and LLaMA2. Usage examples and a Gradio demo are available.

Highlighted Details

  • Over 700K instructions across molecule, protein, and biomolecular text domains.
  • Fine-tuned models demonstrate significant performance improvements over general LLMs on various biomolecular tasks, including molecular generation and protein understanding.
  • Supports tasks like molecule description generation, retrosynthesis, protein function prediction, and chemical entity recognition.
  • Includes quantitative experiments and benchmarks comparing fine-tuned models against general LLMs and other specialized models.

Maintenance & Community

The project is associated with Zjunlp and has had regular updates, including results on LLaMA3 and the release of related projects like ChatCell. The primary paper is accepted at ICLR 2024.

Licensing & Compatibility

The dataset is licensed under CC BY 4.0. All data and model weights are exclusively licensed for research purposes. Usage that may lead to harm or detriment to society is strictly forbidden.

Limitations & Caveats

The current models are preliminary demonstrations and have limited capacity for real-world, production-grade tasks. A significant amount of instruction data remains to be collected.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Author of SWE-bench, SWE-agent), and
13 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.