xtreme by google-research

Benchmark for cross-lingual generalization evaluation of multilingual models

Created 5 years ago

650 stars

Top 51.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of pre-trained multilingual language models. It targets researchers and practitioners in NLP who need to assess model performance across a wide range of languages and tasks, providing a standardized framework for comparing zero-shot cross-lingual transfer abilities.

How It Works

XTREME comprises nine diverse NLP tasks, including sentence classification, named entity recognition, and question answering, spanning 40 typologically diverse languages. The benchmark's core evaluation methodology is zero-shot cross-lingual transfer: models are fine-tuned on English data for each task and then evaluated on test data in other languages. This approach directly measures a model's ability to generalize learned representations across linguistic boundaries without task-specific multilingual fine-tuning.

Quick Start & Requirements

Install: Clone the repository and run bash install_tools.sh for dependencies.
Data Download: Manually download panx_dataset to a download folder, then run bash scripts/download_data.sh.
Prerequisites: Python 3.7+, Anaconda, transformers, seqeval, tensorboardx, jieba, kytea, pythainlp, sacremoses.
Resources: Requires downloading substantial datasets.
More Info: XTREME Website

Highlighted Details

Evaluates cross-lingual generalization across 40 languages and 9 NLP tasks.
Focuses on zero-shot transfer from English fine-tuning.
Includes under-studied languages like Swahili, Yoruba, Tamil, Telugu, and Malayalam.
Provides baseline implementations and evaluation scripts.

Maintenance & Community

This project is from Google Research. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository itself is not explicitly licensed, but it references and utilizes datasets with their own licenses. Users should verify compatibility for commercial use or closed-source linking based on the individual dataset licenses.

Limitations & Caveats

The README notes that automatically translated test sets are "noisy and should not be treated as ground truth." The benchmark's focus is specifically on zero-shot transfer from English, which may not cover all desired cross-lingual evaluation scenarios.

Health Check

Last Commit

3 years ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days