nanowhale  by huggingface

Miniature DeepSeek-V4 LLM training and inference

Created 3 weeks ago

New!

360 stars

Top 77.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Nanowhale is a ~110M parameter language model built from scratch using the DeepSeek-V4 architecture. It offers a miniature, efficient implementation of advanced LLM techniques like MoE and Hyper-Connections, suitable for researchers and developers exploring smaller-scale, high-performance models. The project provides all necessary code, configurations, and tokenizer for pretraining and fine-tuning.

How It Works

The model implements DeepSeek-V4 features at a small scale: Multi-Head Latent Attention (MLA) with MQA and RoPE/NoPE, and Mixture-of-Experts (MoE) with top-2 routing and SwiGLU FFNs. Hyper-Connections with Sinkhorn routing and Multi-Token Prediction (MTP) are also included. This design aims to achieve strong performance with a compact parameter count, leveraging architectural innovations for efficiency.

Quick Start & Requirements

Installation is straightforward via pip install -r requirements.txt. Pretraining can be initiated with python scripts/train_pretrain.py --config configs/main_100m.yaml, SFT fine-tuning with python scripts/train_sft.py, and interactive chat with python scripts/chat.py. Evaluation is available via scripts/eval_smoke.py. Training was performed on an NVIDIA H100 80GB GPU using bf16.

Highlighted Details

  • Achieves a perplexity of 12.90 on held-out English text after SFT.
  • Pretraining throughput reached 72ms/step on an H100 using torch.compile.
  • The model incorporates advanced features like Mixture-of-Experts (MoE) and Hyper-Connections within a ~110M parameter footprint.
  • Provides both a base pretrained model and an SFT chat model.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or sponsorships are present in the provided README.

Licensing & Compatibility

The project is released under the MIT license, permitting broad commercial use and integration without copyleft restrictions.

Limitations & Caveats

The model exhibits NaN issues in bf16 precision due to Hyper-Connections; fp32 is recommended for training and inference. The from_pretrained method has a quirk requiring manual load_state_dict. The large 129K vocabulary size consumes a significant portion (37%) of the total parameters, potentially limiting the model's capacity for complex language modeling tasks at this scale.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
361 stars in the last 23 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
12 more.

modded-nanogpt by KellerJordan

0.5%
5k
Language model training speedrun on 8x H100 GPUs
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.