RNA-FM  by ml4bio

RNA foundation model for RNA sequence analysis and design

Created 3 years ago
303 stars

Top 88.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

RNA-FM is a foundation model for RNA sequences, offering general-purpose embeddings for diverse downstream tasks like structure prediction and functional analysis. It forms the core of an integrated ecosystem including RhoFold (sequence-to-structure), RiboDiffusion, and RhoDesign (structure-to-sequence design), targeting researchers in RNA therapeutics, synthetic biology, and fundamental RNA biology.

How It Works

RNA-FM is a BERT-style transformer encoder pre-trained on over 23 million non-coding RNA sequences using a masked language model objective. This self-supervised approach extracts rich structural and functional information without labeled data, generating 640-dimensional embeddings. The extended ecosystem leverages these embeddings: RhoFold uses them with a geometry module for accurate tertiary structure prediction, while RiboDiffusion (a diffusion model) and RhoDesign (a GVP+Transformer model) employ them for advanced RNA inverse folding and design.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment (conda env create -f environment.yml), activate it (conda activate RNA-FM), and download pre-trained models.
  • Dependencies: Python, PyTorch, Conda. GPU recommended for performance.
  • Usage: python launch/predict.py for embedding generation or secondary structure prediction.
  • Docs: RNA-FM Overview, Online Server

Highlighted Details

  • RNA-FM outperforms other single-sequence RNA language models on structure and function benchmarks.
  • RhoFold achieves state-of-the-art accuracy in RNA tertiary structure prediction, predicting structures in seconds.
  • RiboDiffusion offers tunable sequence diversity in inverse folding, improving sequence recovery by 11-16%.
  • RhoDesign provides deterministic inverse folding with >50% sequence recovery, nearly doubling traditional methods.

Maintenance & Community

The project is actively developed by ml4bio, with associated repositories for RhoFold, RiboDiffusion, and RhoDesign. Community support is available via GitHub Issues.

Licensing & Compatibility

The source code is released under the MIT license, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions a separate server for RhoFold, implying that local tertiary structure prediction might require additional setup or specific dependencies not detailed in the main RNA-FM setup. mRNA-FM requires input sequences to be codon-aligned (length divisible by 3).

Health Check
Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.