Chinese-ELECTRA by ymcui

Chinese ELECTRA pre-trained language models

Created 5 years ago

1,439 stars

Top 28.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository provides pre-trained ELECTRA models for the Chinese language, offering a more efficient alternative to BERT for various NLP tasks. It targets researchers and developers working with Chinese NLP, enabling them to leverage ELECTRA's smaller model size and strong performance.

How It Works

ELECTRA utilizes a novel pre-training approach involving a small generator network that replaces tokens in input text, and a larger discriminator network trained to detect these replaced tokens. This "Replaced Token Detection" (RTD) task is more sample-efficient than BERT's Masked Language Model (MLM), leading to better performance with fewer computational resources. The project focuses solely on the discriminator for downstream fine-tuning.

Quick Start & Requirements

Installation: Models can be loaded via Hugging Face Transformers (AutoTokenizer.from_pretrained(MODEL_NAME)) or PaddleHub (hub.Module(name=MODULE_NAME)).
Dependencies: Python, Hugging Face Transformers (v2.8.0+), or PaddleHub. TensorFlow checkpoints are provided, with a script for conversion to PyTorch.
Resources: Model weights vary in size from 46M (small) to 1G (large).
Links: Hugging Face Models, PaddleHub

Highlighted Details

Offers multiple model sizes (small, base, large) and versions (original, 180g data, legal domain).
Provides extensive benchmark results across reading comprehension, natural language inference, sentiment analysis, and sentence-pair matching tasks.
Includes detailed fine-tuning instructions and hyperparameter examples for common NLP tasks.
Supports both Simplified and Traditional Chinese through datasets like CMRC 2018 and DRCD.

Maintenance & Community

The project is from the Harbin Institute of Technology (HIT) & iFlytek Joint Lab (HFL). Recent activity includes the release of Chinese LLaMA/Alpaca models. Users are encouraged to check the FAQ before submitting issues.

Licensing & Compatibility

The repository does not explicitly state a license. The models are available for download via Google Drive and Baidu Netdisk. Compatibility with commercial or closed-source projects is not specified.

Limitations & Caveats

Primary model weights are provided in TensorFlow format, requiring conversion for PyTorch use.
The project states that pre-training data is not shared.
The FAQ notes that requests for new features may not be accommodated.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days