xtreme  by google-research

Benchmark for cross-lingual generalization evaluation of multilingual models

created 5 years ago
645 stars

Top 52.6% on sourcepulse

GitHubView on GitHub
Project Summary

XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of pre-trained multilingual language models. It targets researchers and practitioners in NLP who need to assess model performance across a wide range of languages and tasks, providing a standardized framework for comparing zero-shot cross-lingual transfer abilities.

How It Works

XTREME comprises nine diverse NLP tasks, including sentence classification, named entity recognition, and question answering, spanning 40 typologically diverse languages. The benchmark's core evaluation methodology is zero-shot cross-lingual transfer: models are fine-tuned on English data for each task and then evaluated on test data in other languages. This approach directly measures a model's ability to generalize learned representations across linguistic boundaries without task-specific multilingual fine-tuning.

Quick Start & Requirements

  • Install: Clone the repository and run bash install_tools.sh for dependencies.
  • Data Download: Manually download panx_dataset to a download folder, then run bash scripts/download_data.sh.
  • Prerequisites: Python 3.7+, Anaconda, transformers, seqeval, tensorboardx, jieba, kytea, pythainlp, sacremoses.
  • Resources: Requires downloading substantial datasets.
  • More Info: XTREME Website

Highlighted Details

  • Evaluates cross-lingual generalization across 40 languages and 9 NLP tasks.
  • Focuses on zero-shot transfer from English fine-tuning.
  • Includes under-studied languages like Swahili, Yoruba, Tamil, Telugu, and Malayalam.
  • Provides baseline implementations and evaluation scripts.

Maintenance & Community

This project is from Google Research. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository itself is not explicitly licensed, but it references and utilizes datasets with their own licenses. Users should verify compatibility for commercial use or closed-source linking based on the individual dataset licenses.

Limitations & Caveats

The README notes that automatically translated test sets are "noisy and should not be treated as ground truth." The benchmark's focus is specifically on zero-shot transfer from English, which may not cover all desired cross-lingual evaluation scenarios.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
8 more.

gpt-3 by openai

0.0%
16k
Research paper on large language model few-shot learning
created 5 years ago
updated 4 years ago
Feedback? Help us improve.