recurrent-pretraining  by seal-rg

Pretraining code for depth-recurrent language model research

created 5 months ago
806 stars

Top 44.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the pretraining code for a large-scale depth-recurrent language model, specifically the huginn-0125 model. It is targeted at researchers and engineers interested in replicating or understanding the training process of such models, particularly on large-scale AMD GPU clusters, offering insights into overcoming hardware-specific challenges.

How It Works

The project implements a recurrent neural network architecture designed for large-scale language model pretraining. It leverages a custom parallelism implementation (SimpleFabric) and an _allreduce_chunk_stream method for inter-node communication, specifically addressing issues encountered with RCCL hangs on AMD systems. The training process is orchestrated via train.py, with model definitions and configurations detailed in repre/model_dynamic.py and launch_configs/ respectively.

Quick Start & Requirements

  • Install/Run: Primarily through Python scripts. The core training command is python train.py --config=launch_configs/your_config.yaml.
  • Prerequisites: Requires a large-scale AMD GPU cluster (4096 GPUs mentioned), CUDA (implied by litgpt base), Python, and potentially specific libraries like bpeasy for tokenizer generation.
  • Setup: Data preparation involves multiple steps: tokenizer generation, scalable data download, and parquet conversion/shuffling, which can be time-consuming and prone to errors with dataset updates.
  • Links: Tech report: https://www.arxiv.org/abs/2502.05171, Model: https://huggingface.co/tomg-group-umd/huginn-0125, Dataset: https://huggingface.co/datasets/tomg-umd/huginn-dataset.

Highlighted Details

  • Trained on 4096 AMD GPUs on the Frontier supercomputer.
  • Utilizes _allreduce_chunk_stream for inter-node communication to mitigate RCCL hangs.
  • Provides code for both training and minimal inference (recpre/raven_modeling_minimal.py).
  • Benchmark scores can be reproduced using lm-eval harness and bigcode for code tasks.

Maintenance & Community

The project is authored by a team including Jonas Geiping, John Kirchenbauer, and others from the TomG group at UMD. The authors encourage users to open issues for questions or details.

Licensing & Compatibility

Released under the Apache-2.0 license. Some code is also licensed under the Lightning AI Apache-2.0 license. This license is permissive and generally compatible with commercial use.

Limitations & Caveats

The README explicitly states that this implementation may not be ideal for users wanting to pretrain their own models, suggesting it's more of a reference. The data preparation scripts are noted as not highly scalable, time-consuming, and susceptible to breaking changes in external datasets.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
7
Star History
55 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Feedback? Help us improve.