create-llm  by theaniketgiri

Scaffolding LLM training projects

Created 5 months ago
305 stars

Top 87.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project addresses the complexity of building and training custom Large Language Models (LLMs) by providing a CLI tool that scaffolds production-ready PyTorch training projects rapidly. It targets engineers and researchers seeking an accelerated path to LLM development, offering a streamlined, "create-next-app" like experience for custom model creation.

How It Works

The tool scaffolds projects using PyTorch, offering four right-sized templates (NANO, TINY, SMALL, BASE) optimized for different use cases from learning to research-grade models. It bundles a complete toolkit including data preprocessing pipelines, multiple tokenizer training options (BPE, WordPiece, Unigram), robust training loops with checkpoint management, evaluation metrics, text generation utilities, and deployment scripts. Smart defaults intelligently configure training parameters, while an optional plugin system integrates with tools like WandB and HuggingFace.

Quick Start & Requirements

  • Primary Install/Run: npx @theanikrtgiri/create-llm <project-name> (recommended).
  • Prerequisites:
    • CLI: Node.js 18.0.0+, npm 8.0.0+.
    • Training: Python 3.8+, PyTorch 2.0.0+.
    • Hardware: Minimum 4GB RAM (NANO/TINY), 12GB VRAM recommended (SMALL), 40GB+ VRAM for BASE.
    • Docker: Docker 20.10+, NVIDIA Docker for GPU support.
  • Links: GitHub, npm.

Highlighted Details

  • Template Variety: Four distinct templates (NANO, TINY, SMALL, BASE) cater to specific needs, ranging from 1M to 1B parameters, with corresponding hardware and time estimates.
  • Comprehensive Feature Set: Out-of-the-box support for data preparation, tokenizer training, checkpointing, TensorBoard, live dashboards, interactive chat, and deployment.
  • Intelligent Defaults & Interactivity: Features smart configuration, auto-detection of parameters, error diagnostics, and interactive prompts for a guided setup experience.
  • Docker First: Strong emphasis on Docker for consistent environments, eliminating local Node.js/Python dependencies and simplifying GPU utilization.

Maintenance & Community

The project is maintained by Aniket Giri. Contributions are welcomed, with specific areas for improvement outlined. Community interaction is primarily through GitHub issues.

Licensing & Compatibility

Released under the MIT License, permitting broad use, modification, and distribution, including for commercial purposes and integration into closed-source projects.

Limitations & Caveats

Training larger models (SMALL, BASE) necessitates substantial GPU VRAM (12GB+, 40GB+). The effectiveness of smaller templates is dependent on sufficient data quantity and quality. While common issues are addressed, complex LLM training may still encounter unforeseen challenges.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Stefan van der Walt Stefan van der Walt(Core Contributor to scientific Python ecosystem), and
12 more.

litgpt by Lightning-AI

0.1%
13k
LLM SDK for pretraining, finetuning, and deploying 20+ high-performance LLMs
Created 2 years ago
Updated 3 days ago
Feedback? Help us improve.