awesome-align by neulab

Neural word aligner for multilingual BERT models

Created 5 years ago

364 stars

Top 77.3% on SourcePulse

Project Summary

Awesome-align provides a neural approach to word alignment using multilingual BERT (mBERT). It targets NLP researchers and practitioners needing to extract word alignments from parallel corpora, offering improved quality over traditional statistical methods and enabling fine-tuning for specific language pairs.

How It Works

Awesome-align leverages mBERT's contextualized embeddings to infer word alignments. It extracts alignments using a 'softmax' method on the cross-lingual attention probabilities. The tool also supports fine-tuning mBERT on parallel data using various objectives like masked language modeling (MLM), translation language modeling (TLM), and self-training (SO) to enhance alignment quality.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt and python setup.py install.
Requires Python and PyTorch.
GPU with CUDA is recommended for performance.
Input data format: tokenized source and target sentences separated by |||.
Official demo and examples are available.

Highlighted Details

Achieves state-of-the-art alignment error rates (AER) on multiple language pairs, outperforming statistical aligners like fast_align and Mgiza.
Offers fine-tuning capabilities for improved performance on specific parallel corpora.
Supports extraction of alignment probabilities and word pairs.
Can incorporate supervised gold alignments for enhanced training.

Maintenance & Community

The project is associated with Neulab and its authors. Code is partially borrowed from HuggingFace Transformers (Apache 2.0).

Licensing & Compatibility

Code is licensed under Apache 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README does not specify hardware requirements beyond GPU recommendations, nor does it detail the expected setup time for fine-tuning. Performance claims are based on specific datasets and may vary.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days