ArchScale  by microsoft

Toolkit for neural architecture research and scaling

Created 2 months ago
293 stars

Top 90.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ArchScale is a comprehensive toolkit for neural language model research, focusing on architecture exploration and scaling laws. It offers implementations of various state-of-the-art architectures, scaling techniques, and training optimizations, targeting researchers and engineers seeking to efficiently train and evaluate large language models.

How It Works

ArchScale employs a unified codebase built on PyTorch Lightning Fabric for distributed training with FSDP, mixed precision, and tensor parallelism. Its core innovation lies in its "WYSIWYG" (What You See Is What You Get) philosophy for experiments, enabling easy modification and addition of architectures, scaling laws, optimizers, and initialization strategies. The framework supports advanced techniques like μP++ for efficient hyperparameter sweeping and scaling, variable-length training for long contexts, and fused kernels for optimized performance.

Quick Start & Requirements

  • Install: Refer to the provided Dockerfile for environment setup.
  • Prerequisites: Python, PyTorch, Lightning Fabric, potentially CUDA >= 12 for GPU acceleration. Specific models may require additional dependencies (e.g., varlen_mamba for Mamba-based models, vLLM for reasoning evaluation).
  • Data: Requires tokenized datasets like SlimPajama or custom data for Phi-4-mini-flash.
  • Resources: Large-scale pre-training examples suggest significant GPU resources (e.g., 1K GPUs for Phi-4-mini-flash).
  • Links: Samba codebase for data tokenization.

Highlighted Details

  • Supports diverse architectures including Transformers, SSMs, attention variants, and hybrid models.
  • Implements scaling laws like μP++, Chinchilla FLOPs scaling, and experimental batch size/weight decay scaling.
  • Offers advanced training features: torch.compile, FSDP, tensor parallelism, experimental fp8 support, packed datasets, and variable-length training.
  • Provides evaluation for standard NLP benchmarks, long-context tasks (RULER, Phonebook), and reasoning capabilities.

Maintenance & Community

The project is actively maintained by Microsoft. Notable contributions are from Liliang Ren, Zichong Li, and Yelong Shen.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Experimental fp8 support is mentioned, implying potential instability or limited availability. The varlen_mamba dependency for Mamba models requires a specific branch installation, indicating potential integration complexities. Reasoning evaluation relies on external libraries and vLLM, which may introduce additional setup steps.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
511
Handbook for large language model training methodologies
Created 2 years ago
Updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 17 hours ago
Feedback? Help us improve.