LLM-for-genomics-training by raphaelmourad

LLM training for genomic sequence analysis

Created 1 year ago

266 stars

Top 96.3% on SourcePulse

Project Summary

This repository provides a comprehensive tutorial for applying Large Language Models (LLMs) to genomics tasks. It targets researchers and practitioners interested in leveraging LLMs for biological sequence analysis, offering practical guidance through lectures and hands-on lab sessions. The benefit lies in demystifying LLM application in a specialized domain, enabling users to perform tasks like sequence classification, mutation effect prediction, and synthetic data generation.

How It Works

The tutorial covers several key LLM workflows applied to DNA sequences. It begins with pretraining a simplified Mistral model from scratch on human genome sequences, utilizing causal language modeling. Subsequently, it demonstrates finetuning pre-existing LLMs from Hugging Face for specific DNA sequence classification tasks, such as identifying protein binding sites or promoter activity. The approach also includes zero-shot learning for predicting mutation effects by analyzing embedding distances and using LLMs for synthetic DNA sequence generation and optimization.

Quick Start & Requirements

The primary method for executing the tutorial's labs is via provided Google Colab scripts. These scripts facilitate access to necessary computational resources and pre-trained models. Key requirements include familiarity with Python and the concepts of LLMs. Specific datasets, such as sequences_hg38_200b_verysmall.csv.gz, are included within the repository. Links to Hugging Face models like RaphaelMourad/Mistral-DNA-v1-17M-hg38 and RaphaelMourad/Mistral-DNA-v1-138M-yeast are provided for direct access.

Highlighted Details

Demonstrates LLM pretraining on DNA sequences using a simplified Mistral architecture.
Covers DNA sequence classification tasks including transcription factor binding and histone mark presence.
Explains zero-shot learning for predicting mutation effects via sequence embedding comparisons.
Includes practical examples for synthetic DNA sequence generation and optimization.

Maintenance & Community

Information regarding maintenance, community channels (like Discord/Slack), or active contributors is not detailed in the provided README.

Licensing & Compatibility

The repository's README does not specify a software license. Users should verify licensing terms for any underlying models or datasets used, particularly for commercial applications.

Limitations & Caveats

The dataset used for pretraining is explicitly labeled "verysmall," suggesting it may not be sufficient for training robust, general-purpose genomic LLMs from scratch. The tutorial's execution relies heavily on Google Colab, which may present limitations for users requiring extensive local control or offline processing capabilities.

LLM-for-genomics-training by raphaelmourad

Explore Similar Projects

InsTag by OFA-Sys

ProteinWorkshop by a-r-j

ArcticTraining by snowflakedb

LucaOne by LucaOne

ACE by Alibaba-NLP

finetune by IndicoDataSolutions

danbooru-diffusion-prompt-builder by wfjsw

tape by songlab-cal

MLE-Flashcards by b7leung

pet by timoschick

electra by google-research

pytext by facebookresearch