ArchScale  by microsoft

Toolkit for neural architecture research and scaling

created 1 month ago
277 stars

Top 94.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ArchScale is a comprehensive toolkit for neural language model research, focusing on architecture exploration and scaling laws. It offers implementations of various state-of-the-art architectures, scaling techniques, and training optimizations, targeting researchers and engineers seeking to efficiently train and evaluate large language models.

How It Works

ArchScale employs a unified codebase built on PyTorch Lightning Fabric for distributed training with FSDP, mixed precision, and tensor parallelism. Its core innovation lies in its "WYSIWYG" (What You See Is What You Get) philosophy for experiments, enabling easy modification and addition of architectures, scaling laws, optimizers, and initialization strategies. The framework supports advanced techniques like μP++ for efficient hyperparameter sweeping and scaling, variable-length training for long contexts, and fused kernels for optimized performance.

Quick Start & Requirements

  • Install: Refer to the provided Dockerfile for environment setup.
  • Prerequisites: Python, PyTorch, Lightning Fabric, potentially CUDA >= 12 for GPU acceleration. Specific models may require additional dependencies (e.g., varlen_mamba for Mamba-based models, vLLM for reasoning evaluation).
  • Data: Requires tokenized datasets like SlimPajama or custom data for Phi-4-mini-flash.
  • Resources: Large-scale pre-training examples suggest significant GPU resources (e.g., 1K GPUs for Phi-4-mini-flash).
  • Links: Samba codebase for data tokenization.

Highlighted Details

  • Supports diverse architectures including Transformers, SSMs, attention variants, and hybrid models.
  • Implements scaling laws like μP++, Chinchilla FLOPs scaling, and experimental batch size/weight decay scaling.
  • Offers advanced training features: torch.compile, FSDP, tensor parallelism, experimental fp8 support, packed datasets, and variable-length training.
  • Provides evaluation for standard NLP benchmarks, long-context tasks (RULER, Phonebook), and reasoning capabilities.

Maintenance & Community

The project is actively maintained by Microsoft. Notable contributions are from Liliang Ren, Zichong Li, and Yelong Shen.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Experimental fp8 support is mentioned, implying potential instability or limited availability. The varlen_mamba dependency for Mamba models requires a specific branch installation, indicating potential integration complexities. Reasoning evaluation relies on external libraries and vLLM, which may introduce additional setup steps.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
277 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.