Code and data for crosslingual multitask finetuning research
Top 59.9% on sourcepulse
This repository provides the components and methodology for creating BLOOMZ and mT0, large language models fine-tuned for cross-lingual generalization through multitask learning. It targets researchers and practitioners in NLP seeking to improve multilingual performance and understand the impact of diverse training data and prompting strategies.
How It Works
The core innovation is the xP3 dataset, a mixture of 13 tasks across 46 languages, with variations including English prompts (xP3) and machine-translated prompts (xP3mt). Models are fine-tuned on this dataset using the Megatron-DeepSpeed framework, enabling efficient training of massive models. This multitask, multilingual approach demonstrably enhances cross-lingual transfer capabilities compared to monolingual or limited-multilingual training.
Quick Start & Requirements
promptsource
(specific branches for xP3/xP3x/xP3mt) and datasets
, iso-639
Python packages. Data preparation involves running provided scripts (e.g., prepare_xp3.py
, create_xp3x.py
).Megatron-DeepSpeed
setup, including specific CUDA versions and potentially large storage for datasets and checkpoints. Training scripts are provided for SLURM environments.promptsource
and lm-evaluation-harness
forks. Code generation evaluation requires additional setup.Highlighted Details
Maintenance & Community
The project is associated with the BigScience workshop and the BLOOM model. Links to community discussions and related projects (e.g., Project Aya) are available.
Licensing & Compatibility
The repository itself does not explicitly state a license. However, the models mentioned (BLOOMZ, mT0) are typically released under permissive licenses allowing research and commercial use, but users should verify the specific license for each model checkpoint.
Limitations & Caveats
Training and reproducing the models require significant computational resources and expertise in distributed training frameworks like Megatron-DeepSpeed. Some dataset versions and evaluation scripts may still be under development (WIP).
10 months ago
1 day