awesome-align  by neulab

Neural word aligner for multilingual BERT models

Created 5 years ago
358 stars

Top 78.0% on SourcePulse

GitHubView on GitHub
Project Summary

Awesome-align provides a neural approach to word alignment using multilingual BERT (mBERT). It targets NLP researchers and practitioners needing to extract word alignments from parallel corpora, offering improved quality over traditional statistical methods and enabling fine-tuning for specific language pairs.

How It Works

Awesome-align leverages mBERT's contextualized embeddings to infer word alignments. It extracts alignments using a 'softmax' method on the cross-lingual attention probabilities. The tool also supports fine-tuning mBERT on parallel data using various objectives like masked language modeling (MLM), translation language modeling (TLM), and self-training (SO) to enhance alignment quality.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt and python setup.py install.
  • Requires Python and PyTorch.
  • GPU with CUDA is recommended for performance.
  • Input data format: tokenized source and target sentences separated by |||.
  • Official demo and examples are available.

Highlighted Details

  • Achieves state-of-the-art alignment error rates (AER) on multiple language pairs, outperforming statistical aligners like fast_align and Mgiza.
  • Offers fine-tuning capabilities for improved performance on specific parallel corpora.
  • Supports extraction of alignment probabilities and word pairs.
  • Can incorporate supervised gold alignments for enhanced training.

Maintenance & Community

The project is associated with Neulab and its authors. Code is partially borrowed from HuggingFace Transformers (Apache 2.0).

Licensing & Compatibility

Code is licensed under Apache 2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README does not specify hardware requirements beyond GPU recommendations, nor does it detail the expected setup time for fine-tuning. Performance claims are based on specific datasets and may vary.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0%
4k
PyTorch library for NLP tasks
Created 9 years ago
Updated 1 month ago
Feedback? Help us improve.