XLM  by facebookresearch

PyTorch code for cross-lingual language model pretraining (XLM/XLM-R)

created 6 years ago
2,916 stars

Top 16.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the original PyTorch implementation for Cross-lingual Language Model Pretraining (XLM), a framework for training language models that excel across multiple languages. It targets researchers and practitioners in NLP looking to leverage or build upon state-of-the-art cross-lingual models for tasks like machine translation and text classification. The library enables monolingual and cross-lingual pretraining, fine-tuning on downstream tasks, and includes implementations for unsupervised and supervised machine translation.

How It Works

XLM supports various pretraining objectives, including Masked Language Model (MLM), Causal Language Model (CLM), and Translation Language Model (TLM). It utilizes Byte Pair Encoding (BPE) for subword tokenization and offers implementations for both monolingual (BERT-style) and cross-lingual models. The framework is designed for scalability, supporting multi-GPU and multi-node training, and incorporates Product-Key Memory (PKM) layers for enhanced model capacity.

Quick Start & Requirements

  • Install via pip: pip install -e .
  • Dependencies: Python 3, NumPy, PyTorch (0.4 or 1.0+), fastBPE, Apex (for fp16).
  • Requires significant data preprocessing and computational resources for training.
  • Official demos and detailed setup instructions are available in the README.

Highlighted Details

  • XLM-R model included, trained on 2.5 TB of data across 100 languages.
  • Achieves state-of-the-art performance on XNLI and GLUE benchmarks.
  • Supports unsupervised and supervised Neural Machine Translation (NMT).
  • Implements Product-Key Memory (PKM) layers for larger model capacity.

Maintenance & Community

This project is from Facebook AI Research (FAIR). The README does not specify active maintenance or community channels like Discord/Slack.

Licensing & Compatibility

The repository includes a LICENSE file, but its specific terms are not detailed in the README. It is generally compatible with commercial use, but users should verify the license details.

Limitations & Caveats

The README mentions that training is sensitive to optimizer parameters and that larger batch sizes improve performance, implying a need for careful tuning and substantial hardware. Some older data sources (e.g., Toronto Book Corpus) are noted as no longer hosted.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Abhishek Thakur Abhishek Thakur(World's First 4x Kaggle GrandMaster), and
5 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
created 6 years ago
updated 2 years ago
Feedback? Help us improve.