NER-BERT-pytorch  by lemonhu

PyTorch solution for named entity recognition

created 6 years ago
448 stars

Top 68.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation for Named Entity Recognition (NER) using Google's BERT model, specifically tailored for Chinese text. It targets researchers and practitioners in Natural Language Processing (NLP) who need a robust solution for identifying entities like Person, Organization, and Location in text. The project offers a clear path to fine-tuning BERT on custom NER datasets, demonstrating strong performance on the MSRA dataset.

How It Works

The project leverages the BERT architecture for its powerful contextual embeddings, fine-tuning it on a sequence labeling task. It processes text using a BIO tagging scheme, where each token is tagged with its entity type and position (Beginning, Inside, or Outside). The implementation uses the pytorch-pretrained-bert library for model loading and management, enabling efficient fine-tuning on the provided MSRA dataset.

Quick Start & Requirements

  • Install via pip: pip install tensorflow>=1.11.0 torch>=0.4.1 pytorch-pretrained-bert==0.4.0 tqdm apex
  • Requires Python 3.5+ and PyTorch 0.4.1/1.0.0.
  • TensorFlow is only needed for converting pre-trained models.
  • apex is recommended for mixed-precision and distributed training.
  • Download pre-trained BERT base Chinese model or convert from TensorFlow checkpoint.
  • Run python build_msra_dataset_tags.py to prepare the dataset.
  • Train using python train.py or python train.py --data_dir <path> --bert_model_dir <path> --model_dir <path>.
  • Evaluate using python evaluate.py.
  • Official BERT Chinese model: https://github.com/google-research/bert
  • pytorch-pretrained-bert: https://github.com/huggingface/pytorch-pretrained-BERT
  • apex: https://github.com/NVIDIA/apex

Highlighted Details

  • Achieved 94.62% F1 score on the MSRA test set without extensive hyperparameter tuning.
  • Detailed per-entity type performance metrics (PER: 96.39%, ORG: 90.84%, LOC: 95.52%).
  • Supports both Chinese and English NER tasks.
  • Provides clear instructions for converting TensorFlow BERT checkpoints to PyTorch.

Maintenance & Community

The project is a personal implementation by lemonhu. There are no explicit mentions of active maintenance, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The repository does not explicitly state a license. However, it relies on libraries with permissive licenses (PyTorch, TensorFlow, Hugging Face's pytorch-pretrained-bert). Commercial use would require careful verification of any implicit licensing or dependencies.

Limitations & Caveats

The project specifies compatibility with older versions of PyTorch (0.4.1/1.0.0) and Python 3.5, which may pose challenges for integration with modern ML stacks. The lack of explicit licensing information is a significant caveat for commercial adoption.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.