marin  by marin-community

Framework for reproducible foundation model research and development

Created 1 year ago
441 stars

Top 67.9% on SourcePulse

GitHubView on GitHub
Project Summary

Marin is an open-source framework designed for the reproducible research and development of foundation models, particularly large language models. It targets researchers and engineers by providing a transparent and auditable workflow, tracking every experimental step from raw data to final model, including failed attempts.

How It Works

Marin structures experiments as a directed acyclic graph (DAG) of steps, similar to a Makefile. Each step represents a distinct operation (e.g., data curation, tokenization, training, evaluation) and can depend on the output of previous steps. This dependency management ensures a reproducible and auditable workflow, allowing users to trace the lineage of their models and understand the impact of each experimental decision.

Quick Start & Requirements

  • Install: pip install marin
  • Prerequisites: Python 3.8+, PyTorch. GPU and CUDA are recommended for larger models.
  • Demo: The README provides a Python script for training a tiny language model on TinyStories.
  • Documentation: Available on ReadTheDocs or in the docs/ folder.

Highlighted Details

  • Enables end-to-end reproducibility for LLM training pipelines, including data processing, tokenization, and evaluation.
  • Supports scaling to large datasets and distributed training across TPUs and multi-node GPUs.
  • Used to train an 8B parameter model that outperforms Llama 3.1 8B.
  • Features a "Datashop" for community data contribution and creation.

Maintenance & Community

  • Active community engagement via Discord.
  • Hosts a "Marin Speedrun" competition for efficient LLM training.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • The project is primarily focused on LLM training; broader applicability to other foundation model types is not detailed.
  • While multi-node GPU support is mentioned as upcoming, current primary examples focus on CPU-only or TPU configurations.
Health Check
Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)
61
Issues (30d)
76
Star History
65 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), and
3 more.

lightning-hydra-template by ashleve

0.1%
5k
ML experimentation template using PyTorch Lightning + Hydra
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.