Discover and explore top open-source AI tools and projects—updated daily.
raphaelmouradLLM training for genomic sequence analysis
Top 98.4% on SourcePulse
This repository provides a comprehensive tutorial for applying Large Language Models (LLMs) to genomics tasks. It targets researchers and practitioners interested in leveraging LLMs for biological sequence analysis, offering practical guidance through lectures and hands-on lab sessions. The benefit lies in demystifying LLM application in a specialized domain, enabling users to perform tasks like sequence classification, mutation effect prediction, and synthetic data generation.
How It Works
The tutorial covers several key LLM workflows applied to DNA sequences. It begins with pretraining a simplified Mistral model from scratch on human genome sequences, utilizing causal language modeling. Subsequently, it demonstrates finetuning pre-existing LLMs from Hugging Face for specific DNA sequence classification tasks, such as identifying protein binding sites or promoter activity. The approach also includes zero-shot learning for predicting mutation effects by analyzing embedding distances and using LLMs for synthetic DNA sequence generation and optimization.
Quick Start & Requirements
The primary method for executing the tutorial's labs is via provided Google Colab scripts. These scripts facilitate access to necessary computational resources and pre-trained models. Key requirements include familiarity with Python and the concepts of LLMs. Specific datasets, such as sequences_hg38_200b_verysmall.csv.gz, are included within the repository. Links to Hugging Face models like RaphaelMourad/Mistral-DNA-v1-17M-hg38 and RaphaelMourad/Mistral-DNA-v1-138M-yeast are provided for direct access.
Highlighted Details
Maintenance & Community
Information regarding maintenance, community channels (like Discord/Slack), or active contributors is not detailed in the provided README.
Licensing & Compatibility
The repository's README does not specify a software license. Users should verify licensing terms for any underlying models or datasets used, particularly for commercial applications.
Limitations & Caveats
The dataset used for pretraining is explicitly labeled "verysmall," suggesting it may not be sufficient for training robust, general-purpose genomic LLMs from scratch. The tutorial's execution relies heavily on Google Colab, which may present limitations for users requiring extensive local control or offline processing capabilities.
5 months ago
Inactive
b7leung
timoschick
google-research
facebookresearch