RNA-FM  by ml4bio

RNA foundation model for RNA sequence analysis and design

created 3 years ago
289 stars

Top 91.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

RNA-FM is a foundation model for RNA sequences, offering general-purpose embeddings for diverse downstream tasks like structure prediction and functional analysis. It forms the core of an integrated ecosystem including RhoFold (sequence-to-structure), RiboDiffusion, and RhoDesign (structure-to-sequence design), targeting researchers in RNA therapeutics, synthetic biology, and fundamental RNA biology.

How It Works

RNA-FM is a BERT-style transformer encoder pre-trained on over 23 million non-coding RNA sequences using a masked language model objective. This self-supervised approach extracts rich structural and functional information without labeled data, generating 640-dimensional embeddings. The extended ecosystem leverages these embeddings: RhoFold uses them with a geometry module for accurate tertiary structure prediction, while RiboDiffusion (a diffusion model) and RhoDesign (a GVP+Transformer model) employ them for advanced RNA inverse folding and design.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment (conda env create -f environment.yml), activate it (conda activate RNA-FM), and download pre-trained models.
  • Dependencies: Python, PyTorch, Conda. GPU recommended for performance.
  • Usage: python launch/predict.py for embedding generation or secondary structure prediction.
  • Docs: RNA-FM Overview, Online Server

Highlighted Details

  • RNA-FM outperforms other single-sequence RNA language models on structure and function benchmarks.
  • RhoFold achieves state-of-the-art accuracy in RNA tertiary structure prediction, predicting structures in seconds.
  • RiboDiffusion offers tunable sequence diversity in inverse folding, improving sequence recovery by 11-16%.
  • RhoDesign provides deterministic inverse folding with >50% sequence recovery, nearly doubling traditional methods.

Maintenance & Community

The project is actively developed by ml4bio, with associated repositories for RhoFold, RiboDiffusion, and RhoDesign. Community support is available via GitHub Issues.

Licensing & Compatibility

The source code is released under the MIT license, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions a separate server for RhoFold, implying that local tertiary structure prediction might require additional setup or specific dependencies not detailed in the main RNA-FM setup. mRNA-FM requires input sequences to be codon-aligned (length divisible by 3).

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.