TinyLlama by jzhang38

Tiny pretraining project for a 1.1B Llama model

Created 2 years ago

8,866 stars

Top 5.8% on SourcePulse

View on GitHub

23 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

George Hotz

Author of tinygrad; Founder of the tiny corp, comma.ai

Tobi Lutke

Cofounder of Shopify

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 19 more!

Project Summary

The TinyLlama project provides a 1.1 billion parameter Llama-compatible language model pre-trained on 3 trillion tokens. It's designed for researchers and developers needing a compact yet capable LLM for applications with restricted computational resources, such as edge devices or real-time dialogue generation, and for studying LLM training dynamics.

How It Works

TinyLlama utilizes the Llama 2 architecture and tokenizer, incorporating optimizations like Flash Attention 2, fused layernorm, swiglu, and rotary positional embeddings. This approach achieves high training throughput (24k tokens/sec/A100) and reduced memory footprint, enabling training on consumer hardware and efficient inference. The extensive pre-training on 3 trillion tokens aims to push the boundaries of smaller models, exploring potential saturation points.

Quick Start & Requirements

Install: Models are available on Hugging Face. Usage typically involves libraries like transformers, llama.cpp, or vLLM.
Requirements: Python, PyTorch. Specific inference frameworks may have their own dependencies (e.g., CUDA for GPU acceleration).
Resources: 4-bit quantized models require ~637MB VRAM. Training requires significant GPU resources (e.g., 16x A100-40G for pre-training).
Links: Chat Demo, Discord, EVAL.md, PRETRAIN.md.

Highlighted Details

1.1B parameter model trained on 3T tokens, matching Llama 2 architecture.
Achieves 24k tokens/sec/A100 throughput with fused optimizations.
Offers intermediate checkpoints at various token milestones (105B to 3T).
Includes fine-tuned chat models and speculative decoding examples.

Maintenance & Community

Active development with community contributions welcomed.
Discord server available for discussion.
Roadmap includes adding pre-training scripts, sequence length extrapolation, and mobile demos.

Licensing & Compatibility

The model weights are released under the Apache 2.0 license.
Compatible with Llama 2 ecosystem projects.

Limitations & Caveats

The project is under active development, with some planned features yet to be implemented.
While extensive, 3T tokens may not represent full saturation for a 1.1B model, as indicated by ongoing research.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

42 stars in the last 30 days