PyTorch code for cross-lingual language model pretraining (XLM/XLM-R)
Top 16.7% on sourcepulse
This repository provides the original PyTorch implementation for Cross-lingual Language Model Pretraining (XLM), a framework for training language models that excel across multiple languages. It targets researchers and practitioners in NLP looking to leverage or build upon state-of-the-art cross-lingual models for tasks like machine translation and text classification. The library enables monolingual and cross-lingual pretraining, fine-tuning on downstream tasks, and includes implementations for unsupervised and supervised machine translation.
How It Works
XLM supports various pretraining objectives, including Masked Language Model (MLM), Causal Language Model (CLM), and Translation Language Model (TLM). It utilizes Byte Pair Encoding (BPE) for subword tokenization and offers implementations for both monolingual (BERT-style) and cross-lingual models. The framework is designed for scalability, supporting multi-GPU and multi-node training, and incorporates Product-Key Memory (PKM) layers for enhanced model capacity.
Quick Start & Requirements
pip install -e .
Highlighted Details
Maintenance & Community
This project is from Facebook AI Research (FAIR). The README does not specify active maintenance or community channels like Discord/Slack.
Licensing & Compatibility
The repository includes a LICENSE file, but its specific terms are not detailed in the README. It is generally compatible with commercial use, but users should verify the license details.
Limitations & Caveats
The README mentions that training is sensitive to optimizer parameters and that larger batch sizes improve performance, implying a need for careful tuning and substantial hardware. Some older data sources (e.g., Toronto Book Corpus) are noted as no longer hosted.
2 years ago
Inactive