Discover and explore top open-source AI tools and projects—updated daily.
seal-rgPretraining code for depth-recurrent language model research
Top 42.4% on SourcePulse
This repository provides the pretraining code for a large-scale depth-recurrent language model, specifically the huginn-0125 model. It is targeted at researchers and engineers interested in replicating or understanding the training process of such models, particularly on large-scale AMD GPU clusters, offering insights into overcoming hardware-specific challenges.
How It Works
The project implements a recurrent neural network architecture designed for large-scale language model pretraining. It leverages a custom parallelism implementation (SimpleFabric) and an _allreduce_chunk_stream method for inter-node communication, specifically addressing issues encountered with RCCL hangs on AMD systems. The training process is orchestrated via train.py, with model definitions and configurations detailed in repre/model_dynamic.py and launch_configs/ respectively.
Quick Start & Requirements
python train.py --config=launch_configs/your_config.yaml.litgpt base), Python, and potentially specific libraries like bpeasy for tokenizer generation.Highlighted Details
_allreduce_chunk_stream for inter-node communication to mitigate RCCL hangs.recpre/raven_modeling_minimal.py).lm-eval harness and bigcode for code tasks.Maintenance & Community
The project is authored by a team including Jonas Geiping, John Kirchenbauer, and others from the TomG group at UMD. The authors encourage users to open issues for questions or details.
Licensing & Compatibility
Released under the Apache-2.0 license. Some code is also licensed under the Lightning AI Apache-2.0 license. This license is permissive and generally compatible with commercial use.
Limitations & Caveats
The README explicitly states that this implementation may not be ideal for users wanting to pretrain their own models, suggesting it's more of a reference. The data preparation scripts are noted as not highly scalable, time-consuming, and susceptible to breaking changes in external datasets.
2 weeks ago
1 day
cli99
mlfoundations
SafeAILab
yandex