recurrent-pretraining by seal-rg

Pretraining code for depth-recurrent language model research

Created 11 months ago

858 stars

Top 41.8% on SourcePulse

View on GitHub

4 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Alex Cheema

Cofounder of EXO Labs

Phil Wang

Prolific Research Paper Implementer

Project Summary

This repository provides the pretraining code for a large-scale depth-recurrent language model, specifically the huginn-0125 model. It is targeted at researchers and engineers interested in replicating or understanding the training process of such models, particularly on large-scale AMD GPU clusters, offering insights into overcoming hardware-specific challenges.

How It Works

The project implements a recurrent neural network architecture designed for large-scale language model pretraining. It leverages a custom parallelism implementation (SimpleFabric) and an _allreduce_chunk_stream method for inter-node communication, specifically addressing issues encountered with RCCL hangs on AMD systems. The training process is orchestrated via train.py, with model definitions and configurations detailed in repre/model_dynamic.py and launch_configs/ respectively.

Quick Start & Requirements

Install/Run: Primarily through Python scripts. The core training command is python train.py --config=launch_configs/your_config.yaml.
Prerequisites: Requires a large-scale AMD GPU cluster (4096 GPUs mentioned), CUDA (implied by litgpt base), Python, and potentially specific libraries like bpeasy for tokenizer generation.
Setup: Data preparation involves multiple steps: tokenizer generation, scalable data download, and parquet conversion/shuffling, which can be time-consuming and prone to errors with dataset updates.
Links: Tech report: https://www.arxiv.org/abs/2502.05171, Model: https://huggingface.co/tomg-group-umd/huginn-0125, Dataset: https://huggingface.co/datasets/tomg-umd/huginn-dataset.

Highlighted Details

Trained on 4096 AMD GPUs on the Frontier supercomputer.
Utilizes _allreduce_chunk_stream for inter-node communication to mitigate RCCL hangs.
Provides code for both training and minimal inference (recpre/raven_modeling_minimal.py).
Benchmark scores can be reproduced using lm-eval harness and bigcode for code tasks.

Maintenance & Community

The project is authored by a team including Jonas Geiping, John Kirchenbauer, and others from the TomG group at UMD. The authors encourage users to open issues for questions or details.

Licensing & Compatibility

Released under the Apache-2.0 license. Some code is also licensed under the Lightning AI Apache-2.0 license. This license is permissive and generally compatible with commercial use.

Limitations & Caveats

The README explicitly states that this implementation may not be ideal for users wanting to pretrain their own models, suggesting it's more of a reference. The data preparation scripts are noted as not highly scalable, time-consuming, and susceptible to breaking changes in external datasets.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days