bert-nmt  by bert-nmt

Training script for BERT-fused Neural Machine Translation (NMT)

created 5 years ago
362 stars

Top 78.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code for BERT-fused Neural Machine Translation (NMT), enhancing translation quality by integrating BERT embeddings. It's targeted at researchers and practitioners in NLP and machine translation looking to leverage large pre-trained language models for improved NMT performance.

How It Works

The approach fuses BERT embeddings into a standard Transformer NMT architecture. BERT's contextual representations are incorporated, likely as initial encoder states or through attention mechanisms, allowing the NMT model to benefit from BERT's deep linguistic understanding. This fusion aims to capture richer semantic information than traditional NMT models, leading to more accurate translations.

Quick Start & Requirements

  • Installation: pip install --editable . after cloning the repository.
  • Prerequisites: PyTorch version 1.0.0/1.1.0, Python version >= 3.5. Requires Fairseq for data preprocessing and baseline NMT training.
  • Data: Requires tokenized and BPE-encoded data files, prepared using Fairseq's prepare-xxx.sh and a custom makedataforbert.sh script.
  • Resources: Training involves standard NMT resource requirements, potentially higher due to BERT integration.
  • Links: Fairseq for baseline NMT.

Highlighted Details

  • Achieved 37.34 BLEU on IWSLT'14 de->en using bert-base-german-dbmdz-uncased.
  • Supports fine-tuning with a pre-trained vanilla NMT model (--warmup-from-nmt).
  • Implements an encoder dropout technique (--encoder-bert-dropout) for regularization.
  • Compatible with Hugging Face's transformers library for various BERT models.

Maintenance & Community

The project is associated with the ICLR 2020 paper "Incorporating BERT into Neural Machine Translation." No specific community channels or active maintenance signals are evident from the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The code requires specific older versions of PyTorch (1.0.0/1.1.0), which may pose compatibility challenges with current environments. The data preparation steps involve custom scripts beyond standard Fairseq.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.