Discover and explore top open-source AI tools and projects—updated daily.
huggingfaceMiniature DeepSeek-V4 LLM training and inference
New!
Top 77.7% on SourcePulse
Summary
Nanowhale is a ~110M parameter language model built from scratch using the DeepSeek-V4 architecture. It offers a miniature, efficient implementation of advanced LLM techniques like MoE and Hyper-Connections, suitable for researchers and developers exploring smaller-scale, high-performance models. The project provides all necessary code, configurations, and tokenizer for pretraining and fine-tuning.
How It Works
The model implements DeepSeek-V4 features at a small scale: Multi-Head Latent Attention (MLA) with MQA and RoPE/NoPE, and Mixture-of-Experts (MoE) with top-2 routing and SwiGLU FFNs. Hyper-Connections with Sinkhorn routing and Multi-Token Prediction (MTP) are also included. This design aims to achieve strong performance with a compact parameter count, leveraging architectural innovations for efficiency.
Quick Start & Requirements
Installation is straightforward via pip install -r requirements.txt. Pretraining can be initiated with python scripts/train_pretrain.py --config configs/main_100m.yaml, SFT fine-tuning with python scripts/train_sft.py, and interactive chat with python scripts/chat.py. Evaluation is available via scripts/eval_smoke.py. Training was performed on an NVIDIA H100 80GB GPU using bf16.
Highlighted Details
torch.compile.Maintenance & Community
No specific details regarding maintainers, community channels (like Discord/Slack), or sponsorships are present in the provided README.
Licensing & Compatibility
The project is released under the MIT license, permitting broad commercial use and integration without copyleft restrictions.
Limitations & Caveats
The model exhibits NaN issues in bf16 precision due to Hyper-Connections; fp32 is recommended for training and inference. The from_pretrained method has a quirk requiring manual load_state_dict. The large 129K vocabulary size consumes a significant portion (37%) of the total parameters, potentially limiting the model's capacity for complex language modeling tasks at this scale.
3 weeks ago
Inactive
KellerJordan