BERT-Relation-Extraction  by plkmo

PyTorch scripts for relation extraction, based on BERT

created 5 years ago
598 stars

Top 55.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a PyTorch implementation for relation extraction using BERT and its variants (ALBERT, BioBERT), based on the "Matching the Blanks" (MTB) methodology. It targets NLP researchers and practitioners looking to leverage pre-trained language models for identifying relationships between entities in text. The primary benefit is an improved approach to relation extraction by incorporating distributional similarity learned through a novel pre-training task.

How It Works

The core approach involves a two-stage process: pre-training and fine-tuning. During pre-training, the model learns to predict masked entities within a context, effectively capturing distributional similarities between entity pairs. This is achieved by using spaCy to identify entities and construct relation statements from continuous text. For fine-tuning, the pre-trained models are adapted to specific relation extraction datasets like SemEval2010 Task 8 and FewRel.

Quick Start & Requirements

  • Install: python3 -m pip install -r requirements.txt
  • Prerequisites: Python 3.8+, bash, spaCy (python3 -m spacy download en_core_web_lg), HuggingFace BERT/ALBERT models, and optionally BioBERT models downloaded to ./additional_models.
  • Datasets: SemEval2010 Task 8 and FewRel 1.0 datasets need to be downloaded and placed in the ./data/ directory.
  • Pre-training data: A .txt file (e.g., cnn.txt) is required for pre-training.
  • Documentation: Towards Data Science article

Highlighted Details

  • Supports BERT, ALBERT, and BioBERT architectures.
  • Implements the "Matching the Blanks" (MTB) pre-training strategy.
  • Achieves 72.766% accuracy on FewRel (5-way 1-shot) with BERT-large.
  • Provides inference capabilities for both manually annotated and automatically detected entities.

Maintenance & Community

This is a non-official repository. The author solicits sponsorships. No community links (Discord, Slack) or roadmap are provided.

Licensing & Compatibility

The repository does not explicitly state a license. It relies on pre-trained models from HuggingFace and BioBERT, which have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

The README notes that the pre-training data used (CNN) is smaller than the wiki dumps used in the original paper, potentially impacting performance. The BioBERT model requires manual download and placement. The repository is marked as non-official.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.