bigbird by google-research

Sparse-attention transformer extends BERT-like models to longer sequences

Created 5 years ago

625 stars

Top 52.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Saining Xie

Professor at NYU

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

BigBird is a sparse-attention based Transformer model designed to extend the capabilities of models like BERT to significantly longer sequences. It targets NLP researchers and practitioners working with tasks such as question answering and summarization, offering improved performance and reduced memory consumption compared to standard Transformers.

How It Works

BigBird employs a sparse attention mechanism, specifically a block-sparse attention pattern, which includes local, global, and random attention. This approach theoretically allows the model to handle the full context of a sequence, unlike other sparse attention methods that might miss information. The advantage lies in its ability to process much longer sequences efficiently, reducing memory overhead without compromising performance on tasks requiring extensive context.

Quick Start & Requirements

Install via pip: pip3 install -e . after cloning the repository.
Requires TensorFlow 2.3.1 and potentially TPUs for optimal performance (demonstrated with GCP TPU setup).
A quick fine-tuning demonstration is available in imdb.ipynb.
Pretrained checkpoints are available on Google Cloud Storage.

Highlighted Details

Extends Transformer models to handle sequences longer than 1024 tokens.
Achieves significant performance improvements on tasks like question answering and summarization.
Reduces memory consumption compared to standard Transformers without sacrificing performance, as shown in Long Range Arena benchmarks.
Supports multiple attention types: original_full, simulated_sparse, and block_sparse.

Maintenance & Community

This is not an official Google product.
Citation is provided for the NeurIPS 2020 paper.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The code currently only handles tensors of static shape, primarily designed for TPUs.
For sequence lengths less than 1024, using the original_full attention is advised as BigBird's sparse attention offers no benefit.

Health Check

Last Commit

3 years ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days