XLM by facebookresearch

PyTorch code for cross-lingual language model pretraining (XLM/XLM-R)

Created 7 years ago

2,922 stars

Top 16.2% on SourcePulse

View on GitHub

14 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Benjamin Bolte

Cofounder of K-Scale Labs

Tom Brown

Cofounder of Anthropic

Tristan Hume

MTS at Anthropic

and 10 more!

Project Summary

This repository provides the original PyTorch implementation for Cross-lingual Language Model Pretraining (XLM), a framework for training language models that excel across multiple languages. It targets researchers and practitioners in NLP looking to leverage or build upon state-of-the-art cross-lingual models for tasks like machine translation and text classification. The library enables monolingual and cross-lingual pretraining, fine-tuning on downstream tasks, and includes implementations for unsupervised and supervised machine translation.

How It Works

XLM supports various pretraining objectives, including Masked Language Model (MLM), Causal Language Model (CLM), and Translation Language Model (TLM). It utilizes Byte Pair Encoding (BPE) for subword tokenization and offers implementations for both monolingual (BERT-style) and cross-lingual models. The framework is designed for scalability, supporting multi-GPU and multi-node training, and incorporates Product-Key Memory (PKM) layers for enhanced model capacity.

Quick Start & Requirements

Install via pip: pip install -e .
Dependencies: Python 3, NumPy, PyTorch (0.4 or 1.0+), fastBPE, Apex (for fp16).
Requires significant data preprocessing and computational resources for training.
Official demos and detailed setup instructions are available in the README.

Highlighted Details

XLM-R model included, trained on 2.5 TB of data across 100 languages.
Achieves state-of-the-art performance on XNLI and GLUE benchmarks.
Supports unsupervised and supervised Neural Machine Translation (NMT).
Implements Product-Key Memory (PKM) layers for larger model capacity.

Maintenance & Community

This project is from Facebook AI Research (FAIR). The README does not specify active maintenance or community channels like Discord/Slack.

Licensing & Compatibility

The repository includes a LICENSE file, but its specific terms are not detailed in the README. It is generally compatible with commercial use, but users should verify the license details.

Limitations & Caveats

The README mentions that training is sensitive to optimizer parameters and that larger batch sizes improve performance, implying a need for careful tuning and substantial hardware. Some older data sources (e.g., Toronto Book Corpus) are noted as no longer hosted.

Health Check

Last Commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days