Toolkit for neural architecture research and scaling
Top 94.5% on sourcepulse
ArchScale is a comprehensive toolkit for neural language model research, focusing on architecture exploration and scaling laws. It offers implementations of various state-of-the-art architectures, scaling techniques, and training optimizations, targeting researchers and engineers seeking to efficiently train and evaluate large language models.
How It Works
ArchScale employs a unified codebase built on PyTorch Lightning Fabric for distributed training with FSDP, mixed precision, and tensor parallelism. Its core innovation lies in its "WYSIWYG" (What You See Is What You Get) philosophy for experiments, enabling easy modification and addition of architectures, scaling laws, optimizers, and initialization strategies. The framework supports advanced techniques like μP++ for efficient hyperparameter sweeping and scaling, variable-length training for long contexts, and fused kernels for optimized performance.
Quick Start & Requirements
varlen_mamba
for Mamba-based models, vLLM
for reasoning evaluation).Highlighted Details
torch.compile
, FSDP, tensor parallelism, experimental fp8 support, packed datasets, and variable-length training.Maintenance & Community
The project is actively maintained by Microsoft. Notable contributions are from Liliang Ren, Zichong Li, and Yelong Shen.
Licensing & Compatibility
MIT License. Permissive for commercial use and integration with closed-source projects.
Limitations & Caveats
Experimental fp8 support is mentioned, implying potential instability or limited availability. The varlen_mamba
dependency for Mamba models requires a specific branch installation, indicating potential integration complexities. Reasoning evaluation relies on external libraries and vLLM, which may introduce additional setup steps.
1 week ago
Inactive