This repository provides a pipeline-based solution for entity and relation extraction, specifically tailored for schema-constrained knowledge extraction tasks. It is designed for researchers and practitioners working with Chinese text data, offering a practical implementation based on TensorFlow and BERT for the 2019 Language and Intelligence Technology Competition.
How It Works
The system employs a two-stage pipeline. First, a multi-label classification model identifies potential relationship types within a sentence. Subsequently, a sequence labeling model, taking the sentence and predicted relationship types as input, identifies and labels the entities (subject and object) corresponding to those relationships. This approach allows for a structured extraction of (Subject, Predicate, Object) triples that adhere to predefined schemas.
Quick Start & Requirements
- Install: Python 3.6+, TensorFlow 1.12.0+. Download and place a Chinese BERT pre-trained model in the
pretrained_model
directory. Download competition data and place it in ./raw_data/
.
- Data: Requires specific training, development, and schema files from the 2019 Language and Intelligence Technology Competition. Official data download links are no longer active; contact provided email for assistance.
- Training: Separate commands are provided for training the relation classification model (
run_predicate_classification.py
) and the sequence labeling model (run_sequnce_labeling.py
).
- Prediction: Commands for inference using trained models are also available.
- Resources: Requires a Chinese BERT model checkpoint.
Highlighted Details
- Achieved 87.1% F1 score on the test set in a competition setting.
- Implements a pipeline approach combining relation classification and sequence labeling.
- Utilizes a large-scale Chinese dataset (SKE) with over 430,000 triples and 210,000 sentences.
- Provides detailed training and prediction scripts for both components.
Maintenance & Community
- The project is associated with the 2019 Language and Intelligence Technology Competition.
- Contact information (email) is provided for data-related inquiries.
- Links to competition forums and related reports are included.
Licensing & Compatibility
- The repository does not explicitly state a license.
- TensorFlow 1.12.0+ is a requirement, which is compatible with commercial use.
Limitations & Caveats
- Official data download links are no longer active, potentially hindering setup.
- The provided test data lacks labels, necessitating submission to official evaluation platforms for validation.
- The project is based on TensorFlow 1.x, which is legacy.