bort by alexa

Companion code for research paper on BERT subarchitecture extraction

Created 5 years ago

470 stars

Top 64.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Paras Jain

Cofounder of Genmo

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

Bort provides an optimal subarchitecture for BERT, significantly reducing its size and computational requirements. This is achieved using a fully polynomial-time approximation scheme (FPTAS) for neural architecture search, making it suitable for researchers and practitioners seeking efficient NLP models.

How It Works

Bort extracts an optimal subset of BERT's architectural parameters, resulting in a model that is 5.5% the size of BERT-large (16% of net size). This approach leverages an FPTAS to efficiently search for efficient subarchitectures, offering substantial speedups (7.9x on CPU vs. BERT-base) and reduced pre-training time (1.2% of RoBERTa-large).

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Tested with Python 3.6.5+.
Pre-training requires Horovod installed from source with MXNet and CUDA 10.1 support.
Download pre-trained model: aws s3 cp s3://alexa-saif-bort/bort.params model/
Download sample text for testing: wget https://github.com/dmlc/gluon-nlp/blob/v0.9.x/scripts/bert/sample_text.txt

Highlighted Details

Achieves an average performance improvement of 0.3% to 31% over BERT-large on NLU benchmarks.
Bort has 56M parameters, 4 layers, 8 attention heads, and a hidden size of 1024.
Offers significant speedups on CPU and reduced pre-training time.
Supports GLUE, SuperGLUE, and RACE datasets with specific data preparation steps.

Maintenance & Community

The project is associated with research papers by Adrian de Wynter and Daniel J. Perry. No specific community channels or active maintenance signals are mentioned in the README.

Licensing & Compatibility

Licensed under the Apache-2.0 License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Fine-tuning may yield odd results without an implementation of the Agora algorithm, which is referenced but not included. Out-of-memory errors can occur with large batch sizes or sequence lengths; reducing sequence length is recommended.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days