TPA by tensorgi

Transformer model implementation for research

Created 1 year ago

443 stars

Top 67.5% on SourcePulse

Project Summary

T6 is an open-source implementation of the Tensor Product Attention (TPA) Transformer, designed to improve performance and reduce KV cache size for large language models. It targets researchers and developers working on efficient and scalable transformer architectures.

How It Works

T6 utilizes Tensor Product Attention (TPA) mechanisms, a novel approach to attention that enhances model performance and significantly reduces the memory footprint of the KV cache. This allows for more efficient training and inference, particularly for large-scale models. The architecture is built upon foundational code from nanoGPT, ensuring a robust and familiar starting point.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm.
Prerequisites: Python 3.10+, PyTorch 2.4.0.
Hardware: A100/H100 GPUs with at least 8x80GB VRAM recommended for pretraining.
Data: Supports Fineweb-Edu-100B and OpenWebText, with provided data preparation scripts.
Docs: Webpage, Huggingface

Highlighted Details

Implements Tensor Product Attention (TPA) for improved performance and reduced KV cache.
Scalable training procedures optimized for multi-GPU setups.
Flexible data support for datasets like Fineweb-Edu-100B and OpenWebText.
Integrated with lm-evaluation-harness for standardized benchmarking.

Maintenance & Community

The project is associated with authors from multiple institutions, including those involved in the original TPA paper. It cites nanoGPT, Hugging Face, and EleutherAI as acknowledgements.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The README mentions "Higher-order TPA (TBD)" and "Flash TPA (TBD)", indicating these advanced features are under development. The recommended hardware (A100/H100 with 8x80GB VRAM) suggests significant resource requirements for pretraining.

TPA by tensorgi

Explore Similar Projects

yalm by andrewkchan

marlin by IST-DASLab

MeZO by princeton-nlp

FineTuningLLMs by dvgodoy

rtp-llm by alibaba

KVCache-Factory by Zefan-Cai

intel-extension-for-pytorch by intel

mini-sglang by sgl-project

CTranslate2 by OpenNMT

peft by huggingface

flash-attention by Dao-AILab

unsloth by unslothai