LLM-for-genomics-training  by raphaelmourad

LLM training for genomic sequence analysis

Created 9 months ago
257 stars

Top 98.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive tutorial for applying Large Language Models (LLMs) to genomics tasks. It targets researchers and practitioners interested in leveraging LLMs for biological sequence analysis, offering practical guidance through lectures and hands-on lab sessions. The benefit lies in demystifying LLM application in a specialized domain, enabling users to perform tasks like sequence classification, mutation effect prediction, and synthetic data generation.

How It Works

The tutorial covers several key LLM workflows applied to DNA sequences. It begins with pretraining a simplified Mistral model from scratch on human genome sequences, utilizing causal language modeling. Subsequently, it demonstrates finetuning pre-existing LLMs from Hugging Face for specific DNA sequence classification tasks, such as identifying protein binding sites or promoter activity. The approach also includes zero-shot learning for predicting mutation effects by analyzing embedding distances and using LLMs for synthetic DNA sequence generation and optimization.

Quick Start & Requirements

The primary method for executing the tutorial's labs is via provided Google Colab scripts. These scripts facilitate access to necessary computational resources and pre-trained models. Key requirements include familiarity with Python and the concepts of LLMs. Specific datasets, such as sequences_hg38_200b_verysmall.csv.gz, are included within the repository. Links to Hugging Face models like RaphaelMourad/Mistral-DNA-v1-17M-hg38 and RaphaelMourad/Mistral-DNA-v1-138M-yeast are provided for direct access.

Highlighted Details

  • Demonstrates LLM pretraining on DNA sequences using a simplified Mistral architecture.
  • Covers DNA sequence classification tasks including transcription factor binding and histone mark presence.
  • Explains zero-shot learning for predicting mutation effects via sequence embedding comparisons.
  • Includes practical examples for synthetic DNA sequence generation and optimization.

Maintenance & Community

Information regarding maintenance, community channels (like Discord/Slack), or active contributors is not detailed in the provided README.

Licensing & Compatibility

The repository's README does not specify a software license. Users should verify licensing terms for any underlying models or datasets used, particularly for commercial applications.

Limitations & Caveats

The dataset used for pretraining is explicitly labeled "verysmall," suggesting it may not be sufficient for training robust, general-purpose genomic LLMs from scratch. The tutorial's execution relies heavily on Google Colab, which may present limitations for users requiring extensive local control or offline processing capabilities.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
18 more.

pytext by facebookresearch

0%
6k
NLP framework (deprecated, migrate to torchtext)
Created 7 years ago
Updated 3 years ago
Feedback? Help us improve.