TokenFormer  by Haiyang-W

Research paper on a fully attention-based neural network with tokenized model parameters

created 9 months ago
567 stars

Top 57.6% on sourcepulse

GitHubView on GitHub
Project Summary

TokenFormer introduces a novel, fully attention-based neural network architecture that tokenizes model parameters, enabling flexible and scalable Transformer designs. It targets researchers and practitioners seeking to enhance Transformer efficiency and adaptability, offering a unified approach to token-token and token-parameter interactions.

How It Works

TokenFormer reimagines the Transformer by treating model parameters as attendable tokens alongside input data tokens. This allows the attention mechanism to mediate interactions between data and parameters, facilitating dynamic, data-dependent parameter updates. This approach aims to maximize architectural flexibility, allowing for the construction of diverse network types, including RNN-like structures (e.g., Mamba) or TTT networks, by manipulating token types and their interactions.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment with Python 3.8, install PyTorch 2.2.1 with CUDA 12.1 support, and then install dependencies via pip install -r requirements/requirements.txt. Additional requirements for flash attention, wandb, tensorboard, comet, and apex are available.
  • Prerequisites: Python 3.8, CUDA 12.x, PyTorch 1.8+, Rust (for potential cargo issues), mpi4py (version 3.0.3 recommended).
  • Resources: Single-GPU evaluation is tested. Training examples are provided for single-node (8-GPU) and multi-node (Slurm) setups.
  • Links: Project Page, HuggingFace Weights, arXiv.

Highlighted Details

  • ICLR2025 Spotlight presentation.
  • Native scalability through tokenized parameters.
  • Supports incremental model scaling, reducing training costs.
  • Pretrained models available for language modeling (150M to 1.5B parameters) on the Pile dataset.
  • Codebase is clean, concise, and relies on minimal dependencies.

Maintenance & Community

The project is led by Haiyang Wang and Bernt Schiele. News and updates are shared via GitHub releases.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The codebase was developed and tested on Python 3.8; compatibility with newer Python versions may be limited due to dependencies. The author notes that training code is released after limited testing and cannot guarantee the absence of issues. Visual modeling benchmarks are to be released later.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

SwissArmyTransformer by THUDM

0.3%
1k
Transformer library for flexible model development
created 3 years ago
updated 7 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.