LucaOne  by LucaOne

Foundation model for biological sequences

Created 1 year ago
299 stars

Top 89.0% on SourcePulse

GitHubView on GitHub
Project Summary

LucaOne provides a generalized biological foundation model capable of processing both nucleic acid (DNA/RNA) and protein sequences. It aims to decode the language of life, offering researchers and developers tools for embedding inference and downstream task adaptation in bioinformatics.

How It Works

LucaOne employs a unified language model architecture trained on a massive dataset encompassing both genetic and protein sequences. This approach allows it to learn a shared representation space, enabling zero-shot and few-shot learning across different biological modalities and facilitating the understanding of fundamental biological processes like the central dogma.

Quick Start & Requirements

  • Installation: Requires Python 3.9.13 (via Anaconda recommended), pip install -r requirements.txt.
  • Data: Pre-training dataset available via FTP (CNGB Sequence Archive accession CNP0007266).
  • Checkpoints: Downloadable from an FTP server.
  • Resources: Significant storage for the pre-training dataset is required.
  • Links: LucaOneApp Github, LucaOneTasks Github, PreTrainingDataset

Highlighted Details

  • Unified model for nucleic acid and protein sequences.
  • Supports multiple pre-training tasks and downstream fine-tuning.
  • Offers embedding inference capabilities.
  • Achieved top rankings in bioinformatics publications and competitions.

Maintenance & Community

The project is associated with Alibaba Cloud and Tongyi Lab, with a team of named contributors. Further details on community channels are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project is available on Zenodo and GitHub, suggesting potential for open-source use. Commercial use implications are not detailed.

Limitations & Caveats

The pre-training dataset is substantial in size and only available via FTP, which may pose accessibility challenges. The project appears to be actively developed with checkpoints updated frequently, indicating potential for breaking changes.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
2 more.

evo by evo-design

0.3%
1k
DNA foundation model for long-context biological sequence modeling and design
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.