guppylm by arman-bd

Accessible LLM training and inference

Created 3 months ago

3,301 stars

Top 14.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jason Huggins

Creator of Selenium

Project Summary

GuppyLM is a ~9M parameter language model designed to demystify the end-to-end process of training an LLM from scratch. It targets engineers and researchers who want to understand LLM mechanics without requiring extensive resources or expertise. The project provides a fully runnable example, from data generation to inference, enabling users to build their own LLM in minutes, fostering practical learning and demystifying "black box" models.

How It Works

The project employs a deliberately simple, vanilla transformer architecture with 8.7 million parameters, eschewing advanced optimizations like GQA, RoPE, or SwiGLU for clarity and ease of implementation. It utilizes a BPE tokenizer with a 4,096-token vocabulary and a 128-token maximum sequence length. Training is performed on 60,000 synthetic conversations generated via template composition across 60 distinct topics, ensuring a consistent, fish-like personality. This minimalist approach prioritizes educational value and rapid iteration over raw performance.

Quick Start & Requirements

Browser Demo: No install needed; runs locally via WebAssembly with a quantized ONNX model (~10 MB). Link
Colab: Pre-trained model chat and full training pipeline available. Notebook Generator
Local CLI: pip install torch tokenizers, then python -m guppylm chat.
Training: Requires a T4 GPU (available in Colab) and approximately 5 minutes.
Dataset: arman-bd/guppylm-60k-generic

Highlighted Details

Minimalist 8.7M parameter vanilla transformer architecture.
End-to-end training demonstrated in a single Colab notebook.
Browser-based inference via WebAssembly and quantized ONNX.
Synthetic data generation pipeline for consistent personality.

Maintenance & Community

The provided README does not detail specific maintenance contributors, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

Released under the MIT license, permitting commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

The model's 128-token context window severely limits multi-turn conversation quality, leading to degradation after a few exchanges. Its personality is hardcoded into the weights rather than being controllable via system prompts, and it does not comprehend complex human abstractions.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

72 stars in the last 30 days