JGLUE  by yahoojapan

Benchmark for Japanese NLU

created 3 years ago
320 stars

Top 86.0% on sourcepulse

GitHubView on GitHub
Project Summary

JGLUE is a comprehensive benchmark for evaluating Japanese Natural Language Understanding (NLU) capabilities, designed to foster research in the Japanese language domain. It comprises six diverse tasks—text classification, sentence pair classification, and question answering—with multiple datasets for each, making it suitable for researchers and developers working with Japanese NLP models.

How It Works

JGLUE was constructed from scratch, avoiding translation from English benchmarks to ensure linguistic authenticity. It utilizes Yahoo! Crowdsourcing for data annotation and includes datasets like MARC-ja (text classification), JSTS (semantic textual similarity), JNLI (natural language inference), JSQuAD (reading comprehension), and JCommonsenseQA (commonsense reasoning). The benchmark provides detailed dataset descriptions and baseline performance scores using various Japanese BERT and RoBERTa models.

Quick Start & Requirements

  • Dataset Preparation: Requires downloading original datasets (e.g., MARC, MS COCO Caption Dataset, SQuAD, CommonsenseQA) and running provided Python scripts for conversion and preprocessing. Specific morphological analyzers (MeCab, Juman++) are needed for certain models.
  • Fine-tuning: The fine-tuning process uses the Hugging Face transformers library. Detailed instructions are available in fine-tuning/README.md.
  • Resources: Preprocessing MARC-ja involves Python dependencies listed in preprocess/requirements.txt. Fine-tuning requires significant computational resources typical for large language models.
  • Links:

Highlighted Details

  • Native Japanese Benchmark: Built entirely from Japanese data, unlike translated benchmarks.
  • Diverse Tasks: Covers text classification, sentence pair classification, and QA with multiple datasets.
  • Comprehensive Baselines: Includes performance scores for various Japanese BERT and RoBERTa models on all tasks.
  • Data Quality: MARC-ja dataset quality was enhanced via crowdsourced judgments.

Maintenance & Community

Developed through a joint research project between Yahoo Japan Corporation and Kawahara Lab at Waseda University. A leaderboard was planned but the test set has been released.

Licensing & Compatibility

  • License: Creative Commons Attribution-ShareAlike 4.0 International License.
  • Contributor License Agreement (CLA): Required for contributors; GitHub contributors implicitly agree. Compatible with most commercial and closed-source applications due to CC-BY-SA 4.0.

Limitations & Caveats

  • The MARC-ja dataset is no longer directly distributed due to the discontinuation of the original MARC dataset by Amazon. Users must obtain the original data and run conversion scripts.
  • XLM-RoBERTa models show poor performance on JSQuAD due to tokenization mismatches.
Health Check
Last commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.