xmtf by bigscience-workshop

Code and data for crosslingual multitask finetuning research

Created 3 years ago

537 stars

Top 59.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Alexander Borzunov

Research Scientist at OpenAI

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This repository provides the components and methodology for creating BLOOMZ and mT0, large language models fine-tuned for cross-lingual generalization through multitask learning. It targets researchers and practitioners in NLP seeking to improve multilingual performance and understand the impact of diverse training data and prompting strategies.

How It Works

The core innovation is the xP3 dataset, a mixture of 13 tasks across 46 languages, with variations including English prompts (xP3) and machine-translated prompts (xP3mt). Models are fine-tuned on this dataset using the Megatron-DeepSpeed framework, enabling efficient training of massive models. This multitask, multilingual approach demonstrably enhances cross-lingual transfer capabilities compared to monolingual or limited-multilingual training.

Quick Start & Requirements

Data Creation: Requires promptsource (specific branches for xP3/xP3x/xP3mt) and datasets, iso-639 Python packages. Data preparation involves running provided scripts (e.g., prepare_xp3.py, create_xp3x.py).
Model Training: Requires Megatron-DeepSpeed setup, including specific CUDA versions and potentially large storage for datasets and checkpoints. Training scripts are provided for SLURM environments.
Evaluation: Requires promptsource and lm-evaluation-harness forks. Code generation evaluation requires additional setup.
Resources: Training large models demands significant GPU resources (e.g., PP=72, TP=1, DP=4 for BLOOMZ) and substantial disk space for datasets and checkpoints.

Highlighted Details

Offers multiple xP3 dataset variants (xP3, xP3mt, xP3x) covering 277 languages.
Provides fine-tuned models (BLOOMZ, mT0) in various sizes (300M to 176B parameters).
Includes detailed evaluation scripts for rank and generation tasks across multiple benchmarks.
Contains extensive plots and tables illustrating model performance, data characteristics, and generalization capabilities.

Maintenance & Community

The project is associated with the BigScience workshop and the BLOOM model. Links to community discussions and related projects (e.g., Project Aya) are available.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, the models mentioned (BLOOMZ, mT0) are typically released under permissive licenses allowing research and commercial use, but users should verify the specific license for each model checkpoint.

Limitations & Caveats

Training and reproducing the models require significant computational resources and expertise in distributed training frameworks like Megatron-DeepSpeed. Some dataset versions and evaluation scripts may still be under development (WIP).

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days