TPA  by tensorgi

Transformer model implementation for research

Created 8 months ago
385 stars

Top 74.2% on SourcePulse

GitHubView on GitHub
Project Summary

T6 is an open-source implementation of the Tensor Product Attention (TPA) Transformer, designed to improve performance and reduce KV cache size for large language models. It targets researchers and developers working on efficient and scalable transformer architectures.

How It Works

T6 utilizes Tensor Product Attention (TPA) mechanisms, a novel approach to attention that enhances model performance and significantly reduces the memory footprint of the KV cache. This allows for more efficient training and inference, particularly for large-scale models. The architecture is built upon foundational code from nanoGPT, ensuring a robust and familiar starting point.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm.
  • Prerequisites: Python 3.10+, PyTorch 2.4.0.
  • Hardware: A100/H100 GPUs with at least 8x80GB VRAM recommended for pretraining.
  • Data: Supports Fineweb-Edu-100B and OpenWebText, with provided data preparation scripts.
  • Docs: Webpage, Huggingface

Highlighted Details

  • Implements Tensor Product Attention (TPA) for improved performance and reduced KV cache.
  • Scalable training procedures optimized for multi-GPU setups.
  • Flexible data support for datasets like Fineweb-Edu-100B and OpenWebText.
  • Integrated with lm-evaluation-harness for standardized benchmarking.

Maintenance & Community

The project is associated with authors from multiple institutions, including those involved in the original TPA paper. It cites nanoGPT, Hugging Face, and EleutherAI as acknowledgements.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The README mentions "Higher-order TPA (TBD)" and "Flash TPA (TBD)", indicating these advanced features are under development. The recommended hardware (A100/H100 with 8x80GB VRAM) suggests significant resource requirements for pretraining.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
36 more.

unsloth by unslothai

0.6%
46k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 14 hours ago
Feedback? Help us improve.