Chinese-XLNet  by ymcui

Chinese XLNet pre-trained models for NLP tasks

created 6 years ago
1,650 stars

Top 26.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides pre-trained XLNet models for Chinese natural language processing, aiming to enrich the Chinese NLP ecosystem with diverse model options. It is targeted at researchers and practitioners in Chinese NLP who need robust language models for various downstream tasks.

How It Works

The project offers two Chinese XLNet models: XLNet-mid (24 layers, 768 hidden size, 12 heads, 209M parameters) and XLNet-base (12 layers, 768 hidden size, 12 heads, 117M parameters). These models are trained on a large corpus of Chinese data (5.4B tokens), including Wikipedia and general domain data. The training process follows the official XLNet methodology, utilizing SentencePiece for tokenization and generating TFRecords for training.

Quick Start & Requirements

  • Installation: Models can be loaded via the Huggingface Transformers library.
    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-xlnet-mid")
    model = AutoModel.from_pretrained("hfl/chinese-xlnet-mid")
    
  • Prerequisites: Python, Huggingface Transformers (version 2.2.2 or later).
  • Model Downloads: Pre-trained weights are available via Google Drive and Baidu Netdisk. PyTorch versions can be converted or downloaded from Huggingface.
  • Resources: XLNet-mid model files are approximately 800MB.

Highlighted Details

  • Achieves competitive results on Chinese NLP benchmarks like CMRC 2018 (Reading Comprehension) and DRCD (Traditional Chinese Reading Comprehension), outperforming BERT variants in some cases.
  • Provides detailed pre-training and fine-tuning configurations, including commands for data preparation, training, and task-specific fine-tuning on CMRC 2018, DRCD, and ChnSentiCorp.
  • The project is based on the official CMU/Google XLNet implementation.
  • A technical report detailing the models and their performance is available on arXiv.

Maintenance & Community

  • Developed by Harbin Institute of Technology (HIT) and iFlytek Joint Laboratory (HFL).
  • The project is supported by the Google TensorFlow Research Cloud (TFRC) program.
  • Issues and contributions can be submitted via GitHub Issues and Pull Requests.

Licensing & Compatibility

  • The models are available for technical research reference and can be used within the license terms.
  • The project is not an official product of XLNet or iFlytek.

Limitations & Caveats

  • The pre-training dataset is not publicly available due to copyright issues.
  • The project does not guarantee the release of larger models, only if significant performance improvements are achieved.
  • Users experiencing poor performance on specific datasets are advised to continue pre-training on their own data or use alternative models.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Abhishek Thakur Abhishek Thakur(World's First 4x Kaggle GrandMaster), and
5 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
created 6 years ago
updated 2 years ago
Feedback? Help us improve.