recurrent-pretraining  by seal-rg

Pretraining code for depth-recurrent language model research

Created 1 year ago
887 stars

Top 40.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the pretraining code for a large-scale depth-recurrent language model, specifically the huginn-0125 model. It is targeted at researchers and engineers interested in replicating or understanding the training process of such models, particularly on large-scale AMD GPU clusters, offering insights into overcoming hardware-specific challenges.

How It Works

The project implements a recurrent neural network architecture designed for large-scale language model pretraining. It leverages a custom parallelism implementation (SimpleFabric) and an _allreduce_chunk_stream method for inter-node communication, specifically addressing issues encountered with RCCL hangs on AMD systems. The training process is orchestrated via train.py, with model definitions and configurations detailed in repre/model_dynamic.py and launch_configs/ respectively.

Quick Start & Requirements

  • Install/Run: Primarily through Python scripts. The core training command is python train.py --config=launch_configs/your_config.yaml.
  • Prerequisites: Requires a large-scale AMD GPU cluster (4096 GPUs mentioned), CUDA (implied by litgpt base), Python, and potentially specific libraries like bpeasy for tokenizer generation.
  • Setup: Data preparation involves multiple steps: tokenizer generation, scalable data download, and parquet conversion/shuffling, which can be time-consuming and prone to errors with dataset updates.
  • Links: Tech report: https://www.arxiv.org/abs/2502.05171, Model: https://huggingface.co/tomg-group-umd/huginn-0125, Dataset: https://huggingface.co/datasets/tomg-umd/huginn-dataset.

Highlighted Details

  • Trained on 4096 AMD GPUs on the Frontier supercomputer.
  • Utilizes _allreduce_chunk_stream for inter-node communication to mitigate RCCL hangs.
  • Provides code for both training and minimal inference (recpre/raven_modeling_minimal.py).
  • Benchmark scores can be reproduced using lm-eval harness and bigcode for code tasks.

Maintenance & Community

The project is authored by a team including Jonas Geiping, John Kirchenbauer, and others from the TomG group at UMD. The authors encourage users to open issues for questions or details.

Licensing & Compatibility

Released under the Apache-2.0 license. Some code is also licensed under the Lightning AI Apache-2.0 license. This license is permissive and generally compatible with commercial use.

Limitations & Caveats

The README explicitly states that this implementation may not be ideal for users wanting to pretrain their own models, suggesting it's more of a reference. The data preparation scripts are noted as not highly scalable, time-consuming, and susceptible to breaking changes in external datasets.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0%
486
CLI tool for LLM latency/memory analysis during training/inference
Created 3 years ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.6%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 months ago
Feedback? Help us improve.