recurrent-pretraining  by seal-rg

Pretraining code for depth-recurrent language model research

Created 7 months ago
827 stars

Top 42.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the pretraining code for a large-scale depth-recurrent language model, specifically the huginn-0125 model. It is targeted at researchers and engineers interested in replicating or understanding the training process of such models, particularly on large-scale AMD GPU clusters, offering insights into overcoming hardware-specific challenges.

How It Works

The project implements a recurrent neural network architecture designed for large-scale language model pretraining. It leverages a custom parallelism implementation (SimpleFabric) and an _allreduce_chunk_stream method for inter-node communication, specifically addressing issues encountered with RCCL hangs on AMD systems. The training process is orchestrated via train.py, with model definitions and configurations detailed in repre/model_dynamic.py and launch_configs/ respectively.

Quick Start & Requirements

  • Install/Run: Primarily through Python scripts. The core training command is python train.py --config=launch_configs/your_config.yaml.
  • Prerequisites: Requires a large-scale AMD GPU cluster (4096 GPUs mentioned), CUDA (implied by litgpt base), Python, and potentially specific libraries like bpeasy for tokenizer generation.
  • Setup: Data preparation involves multiple steps: tokenizer generation, scalable data download, and parquet conversion/shuffling, which can be time-consuming and prone to errors with dataset updates.
  • Links: Tech report: https://www.arxiv.org/abs/2502.05171, Model: https://huggingface.co/tomg-group-umd/huginn-0125, Dataset: https://huggingface.co/datasets/tomg-umd/huginn-dataset.

Highlighted Details

  • Trained on 4096 AMD GPUs on the Frontier supercomputer.
  • Utilizes _allreduce_chunk_stream for inter-node communication to mitigate RCCL hangs.
  • Provides code for both training and minimal inference (recpre/raven_modeling_minimal.py).
  • Benchmark scores can be reproduced using lm-eval harness and bigcode for code tasks.

Maintenance & Community

The project is authored by a team including Jonas Geiping, John Kirchenbauer, and others from the TomG group at UMD. The authors encourage users to open issues for questions or details.

Licensing & Compatibility

Released under the Apache-2.0 license. Some code is also licensed under the Lightning AI Apache-2.0 license. This license is permissive and generally compatible with commercial use.

Limitations & Caveats

The README explicitly states that this implementation may not be ideal for users wanting to pretrain their own models, suggesting it's more of a reference. The data preparation scripts are noted as not highly scalable, time-consuming, and susceptible to breaking changes in external datasets.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0.4%
455
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 5 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.