awesome-align  by neulab

Neural word aligner for multilingual BERT models

created 4 years ago
354 stars

Top 80.0% on sourcepulse

GitHubView on GitHub
Project Summary

Awesome-align provides a neural approach to word alignment using multilingual BERT (mBERT). It targets NLP researchers and practitioners needing to extract word alignments from parallel corpora, offering improved quality over traditional statistical methods and enabling fine-tuning for specific language pairs.

How It Works

Awesome-align leverages mBERT's contextualized embeddings to infer word alignments. It extracts alignments using a 'softmax' method on the cross-lingual attention probabilities. The tool also supports fine-tuning mBERT on parallel data using various objectives like masked language modeling (MLM), translation language modeling (TLM), and self-training (SO) to enhance alignment quality.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt and python setup.py install.
  • Requires Python and PyTorch.
  • GPU with CUDA is recommended for performance.
  • Input data format: tokenized source and target sentences separated by |||.
  • Official demo and examples are available.

Highlighted Details

  • Achieves state-of-the-art alignment error rates (AER) on multiple language pairs, outperforming statistical aligners like fast_align and Mgiza.
  • Offers fine-tuning capabilities for improved performance on specific parallel corpora.
  • Supports extraction of alignment probabilities and word pairs.
  • Can incorporate supervised gold alignments for enhanced training.

Maintenance & Community

The project is associated with Neulab and its authors. Code is partially borrowed from HuggingFace Transformers (Apache 2.0).

Licensing & Compatibility

Code is licensed under Apache 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README does not specify hardware requirements beyond GPU recommendations, nor does it detail the expected setup time for fine-tuning. Performance claims are based on specific datasets and may vary.

Health Check
Last commit

3 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

setfit by huggingface

0.2%
3k
Few-shot learning framework for Sentence Transformers
created 3 years ago
updated 3 months ago
Feedback? Help us improve.