nanowhale by huggingface

Miniature DeepSeek-V4 LLM training and inference

Created 2 months ago

380 stars

Top 74.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Lewis Tunstall

Research Engineer at Hugging Face

Project Summary

Summary

Nanowhale is a ~110M parameter language model built from scratch using the DeepSeek-V4 architecture. It offers a miniature, efficient implementation of advanced LLM techniques like MoE and Hyper-Connections, suitable for researchers and developers exploring smaller-scale, high-performance models. The project provides all necessary code, configurations, and tokenizer for pretraining and fine-tuning.

How It Works

The model implements DeepSeek-V4 features at a small scale: Multi-Head Latent Attention (MLA) with MQA and RoPE/NoPE, and Mixture-of-Experts (MoE) with top-2 routing and SwiGLU FFNs. Hyper-Connections with Sinkhorn routing and Multi-Token Prediction (MTP) are also included. This design aims to achieve strong performance with a compact parameter count, leveraging architectural innovations for efficiency.

Quick Start & Requirements

Installation is straightforward via pip install -r requirements.txt. Pretraining can be initiated with python scripts/train_pretrain.py --config configs/main_100m.yaml, SFT fine-tuning with python scripts/train_sft.py, and interactive chat with python scripts/chat.py. Evaluation is available via scripts/eval_smoke.py. Training was performed on an NVIDIA H100 80GB GPU using bf16.

Highlighted Details

Achieves a perplexity of 12.90 on held-out English text after SFT.
Pretraining throughput reached 72ms/step on an H100 using torch.compile.
The model incorporates advanced features like Mixture-of-Experts (MoE) and Hyper-Connections within a ~110M parameter footprint.
Provides both a base pretrained model and an SFT chat model.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or sponsorships are present in the provided README.

Licensing & Compatibility

The project is released under the MIT license, permitting broad commercial use and integration without copyleft restrictions.

Limitations & Caveats

The model exhibits NaN issues in bf16 precision due to Hyper-Connections; fp32 is recommended for training and inference. The from_pretrained method has a quirk requiring manual load_state_dict. The large 129K vocabulary size consumes a significant portion (37%) of the total parameters, potentially limiting the model's capacity for complex language modeling tasks at this scale.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days