marin by marin-community

Framework for reproducible foundation model research and development

Created 1 year ago

707 stars

Top 48.4% on SourcePulse

View on GitHub

7 Experts Love This Project

Nathan Lambert

Research Scientist at AI2

Lysandre Debut

Chief Open-Source Officer at Hugging Face

John Yang

Coauthor of SWE-bench, SWE-agent

Percy Liang

Cofounder of Together AI; Professor at Stanford

and 3 more!

Project Summary

Marin is an open-source framework designed for the reproducible research and development of foundation models, particularly large language models. It targets researchers and engineers by providing a transparent and auditable workflow, tracking every experimental step from raw data to final model, including failed attempts.

How It Works

Marin structures experiments as a directed acyclic graph (DAG) of steps, similar to a Makefile. Each step represents a distinct operation (e.g., data curation, tokenization, training, evaluation) and can depend on the output of previous steps. This dependency management ensures a reproducible and auditable workflow, allowing users to trace the lineage of their models and understand the impact of each experimental decision.

Quick Start & Requirements

Install: pip install marin
Prerequisites: Python 3.8+, PyTorch. GPU and CUDA are recommended for larger models.
Demo: The README provides a Python script for training a tiny language model on TinyStories.
Documentation: Available on ReadTheDocs or in the docs/ folder.

Highlighted Details

Enables end-to-end reproducibility for LLM training pipelines, including data processing, tokenization, and evaluation.
Supports scaling to large datasets and distributed training across TPUs and multi-node GPUs.
Used to train an 8B parameter model that outperforms Llama 3.1 8B.
Features a "Datashop" for community data contribution and creation.

Maintenance & Community

Active community engagement via Discord.
Hosts a "Marin Speedrun" competition for efficient LLM training.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The project is primarily focused on LLM training; broader applicability to other foundation model types is not detailed.
While multi-node GPU support is mentioned as upcoming, current primary examples focus on CPU-only or TPU configurations.

Health Check

Last Commit

9 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

52 stars in the last 30 days