pico-train by pico-lm

Minimalist framework for language model training and learning dynamics research

Created 1 year ago

296 stars

Top 89.6% on SourcePulse

Project Summary

Pico Train is a minimalistic, research-focused framework for training language models from 1 million to 1 billion parameters. It addresses the need for transparent, reproducible learning dynamics research by providing comprehensive, granular checkpoints that include activations and gradients, alongside a standardized data and architecture approach for cross-scale comparisons. The target audience includes researchers and engineers interested in understanding the internal workings and scaling laws of LLMs.

How It Works

Pico Train utilizes a standardized LLAMA-style architecture (Pico Decoder) with components like RMSNorm, RoPE, and SwiGLU. Its core advantage lies in its "comprehensive checkpointing" strategy, which automatically saves not only model and optimizer states but also activations and gradients at regular intervals. This rich data, combined with a consistent training philosophy (identical data, architecture, optimizer settings across scales), enables direct comparison of learning dynamics as model size varies, isolating size as the primary variable.

Quick Start & Requirements

Install via source setup.sh (creates Poetry environment, installs dependencies).
Requires Hugging Face and Weights & Biases API tokens (configured via .env file).
Training initiated with poetry run train --config_path configs/demo.yaml.
Full tutorial available at picolm.io.

Highlighted Details

Optimized for 1M-1B parameter models, targeting viable learning dynamics research.
Uses a pre-tokenized, pre-shuffled Dolma dataset for consistency.
Checkpoints include model state, optimizer state, activations, gradients, and logs.
Seamless integration with the pico-analyze library for advanced post-training interpretation.

Maintenance & Community

Primarily developed by Richard Diehl Martinez.
Website: picolm.io.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently supports only the Pico Decoder architecture, with plans for future expansion. The framework is geared towards research and may require adaptation for production deployment workflows.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days