reformer-pytorch  by lucidrains

PyTorch implementation of the Reformer, an efficient Transformer research paper

created 5 years ago
2,177 stars

Top 21.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of the Reformer model, an efficient Transformer architecture designed for handling long sequences with reduced memory and computational costs. It is suitable for researchers and practitioners working with large-scale sequence modeling tasks, offering significant memory savings over standard Transformers.

How It Works

The core innovation is the use of Locality-Sensitive Hashing (LSH) attention, which approximates the full attention mechanism by hashing queries and keys into buckets, allowing attention to be computed only within buckets. This is combined with reversible layers, which reduce memory usage by recomputing activations during the backward pass instead of storing them, and chunking for feedforward and attention layers to further manage memory.

Quick Start & Requirements

  • Install via pip: pip install reformer_pytorch
  • Requires PyTorch. GPU with CUDA is recommended for performance.
  • Official documentation and examples are available in the README.

Highlighted Details

  • Implements LSH attention, reversible layers, and chunking for efficiency.
  • Supports various positional embeddings (rotary, axial, absolute).
  • Includes optional features like Product Key Memory (PKM), GLU feedforward, and layer dropout.
  • Offers a ReformerEncDec wrapper for encoder-decoder architectures.
  • Compatible with Microsoft's DeepSpeed for distributed training.

Maintenance & Community

The project is actively maintained by lucidrains, with contributions from various individuals. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README notes a potential instability issue with O2 optimization level during mixed-precision training, recommending O1. It also mentions that sequence lengths must be divisible by bucket_size * 2, with an Autopadder helper provided to address this.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.