bigbird  by google-research

Sparse-attention transformer extends BERT-like models to longer sequences

created 4 years ago
618 stars

Top 54.2% on sourcepulse

GitHubView on GitHub
Project Summary

BigBird is a sparse-attention based Transformer model designed to extend the capabilities of models like BERT to significantly longer sequences. It targets NLP researchers and practitioners working with tasks such as question answering and summarization, offering improved performance and reduced memory consumption compared to standard Transformers.

How It Works

BigBird employs a sparse attention mechanism, specifically a block-sparse attention pattern, which includes local, global, and random attention. This approach theoretically allows the model to handle the full context of a sequence, unlike other sparse attention methods that might miss information. The advantage lies in its ability to process much longer sequences efficiently, reducing memory overhead without compromising performance on tasks requiring extensive context.

Quick Start & Requirements

  • Install via pip: pip3 install -e . after cloning the repository.
  • Requires TensorFlow 2.3.1 and potentially TPUs for optimal performance (demonstrated with GCP TPU setup).
  • A quick fine-tuning demonstration is available in imdb.ipynb.
  • Pretrained checkpoints are available on Google Cloud Storage.

Highlighted Details

  • Extends Transformer models to handle sequences longer than 1024 tokens.
  • Achieves significant performance improvements on tasks like question answering and summarization.
  • Reduces memory consumption compared to standard Transformers without sacrificing performance, as shown in Long Range Arena benchmarks.
  • Supports multiple attention types: original_full, simulated_sparse, and block_sparse.

Maintenance & Community

  • This is not an official Google product.
  • Citation is provided for the NeurIPS 2020 paper.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • The code currently only handles tensors of static shape, primarily designed for TPUs.
  • For sequence lengths less than 1024, using the original_full attention is advised as BigBird's sparse attention offers no benefit.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.