nanocode by salmanmohammadi

Claude Code model training library

Created 5 months ago

258 stars

Top 98.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

David Cournapeau

Author of scikit-learn

Project Summary

Summary

nanocode is a comprehensive JAX library for end-to-end training of custom Claude Code large language models using Constitutional AI. It targets researchers and developers aiming to build powerful, cost-effective code generation models, offering a full pipeline from tokenizer training to agentic SFT and DPO alignment, optimized for Google TPUs.

How It Works

The library is built in pure JAX, leveraging TPU acceleration for efficient training. Its core approach integrates Constitutional AI principles throughout the model development lifecycle. This includes custom tokenizer training, large-scale pretraining on diverse datasets, synthetic data generation pipelines for specialized tasks, agentic supervised fine-tuning with tool use capabilities, and Direct Preference Optimization (DPO) for constitutional alignment, enabling fine-grained control over model behavior.

Quick Start & Requirements

Primary Install/Run: Setup involves provisioning a Google Cloud TPU (e.g., v6e-8), configuring gcloud, and running ./install.sh tpu on the TPU pod, followed by executing speedrun_*.sh scripts for training.
Prerequisites: Google Cloud account with TPU access (TRC program or paid), gcloud CLI, tmux. For synthetic data generation, an OpenRouter API key or a local vLLM server is needed. NVIDIA GPUs require --attn-impl=eager.
Resource Footprint: Training a 1.3B parameter model (d24) costs approximately $200 and takes ~9 hours on a TPU v6e-8. A 477M parameter model (d20) costs $34 and takes ~1.5 hours.
Links: Announcement Post (implied), HuggingFace datasets (e.g., smohammadi/nanocode-tulu-selfoss-evol).

Highlighted Details

Full Constitutional AI training suite: tokenizer, pretraining, synthetic data generation, agentic SFT, DPO.
Cost-effective model training: 1.3B params for $200, 477M params for $34.
Optimized for TPUs with pure JAX implementation.
Includes tools for generating synthetic data and provides pre-packaged datasets on HuggingFace.

Maintenance & Community

The project is authored by Salman Mohammadi. No specific community channels (e.g., Discord, Slack), sponsorships, or notable contributors are detailed in the provided README.

Licensing & Compatibility

The project is released under the MIT license, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is primarily optimized for and designed to run on Google TPUs. While NVIDIA GPU support is mentioned, it requires specific flags (--attn-impl=eager) and has not been extensively tested on multi-GPU configurations. Synthetic data generation necessitates external API keys or local server setups.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

260 stars in the last 30 days