TPA  by tensorgi

Transformer model implementation for research

created 6 months ago
379 stars

Top 76.2% on sourcepulse

GitHubView on GitHub
Project Summary

T6 is an open-source implementation of the Tensor Product Attention (TPA) Transformer, designed to improve performance and reduce KV cache size for large language models. It targets researchers and developers working on efficient and scalable transformer architectures.

How It Works

T6 utilizes Tensor Product Attention (TPA) mechanisms, a novel approach to attention that enhances model performance and significantly reduces the memory footprint of the KV cache. This allows for more efficient training and inference, particularly for large-scale models. The architecture is built upon foundational code from nanoGPT, ensuring a robust and familiar starting point.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm.
  • Prerequisites: Python 3.10+, PyTorch 2.4.0.
  • Hardware: A100/H100 GPUs with at least 8x80GB VRAM recommended for pretraining.
  • Data: Supports Fineweb-Edu-100B and OpenWebText, with provided data preparation scripts.
  • Docs: Webpage, Huggingface

Highlighted Details

  • Implements Tensor Product Attention (TPA) for improved performance and reduced KV cache.
  • Scalable training procedures optimized for multi-GPU setups.
  • Flexible data support for datasets like Fineweb-Edu-100B and OpenWebText.
  • Integrated with lm-evaluation-harness for standardized benchmarking.

Maintenance & Community

The project is associated with authors from multiple institutions, including those involved in the original TPA paper. It cites nanoGPT, Hugging Face, and EleutherAI as acknowledgements.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The README mentions "Higher-order TPA (TBD)" and "Flash TPA (TBD)", indicating these advanced features are under development. The recommended hardware (A100/H100 with 8x80GB VRAM) suggests significant resource requirements for pretraining.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
258
Efficiently train foundation models with PyTorch
created 1 year ago
updated 1 week ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.