PyTorch solution for named entity recognition
Top 68.1% on sourcepulse
This repository provides a PyTorch implementation for Named Entity Recognition (NER) using Google's BERT model, specifically tailored for Chinese text. It targets researchers and practitioners in Natural Language Processing (NLP) who need a robust solution for identifying entities like Person, Organization, and Location in text. The project offers a clear path to fine-tuning BERT on custom NER datasets, demonstrating strong performance on the MSRA dataset.
How It Works
The project leverages the BERT architecture for its powerful contextual embeddings, fine-tuning it on a sequence labeling task. It processes text using a BIO tagging scheme, where each token is tagged with its entity type and position (Beginning, Inside, or Outside). The implementation uses the pytorch-pretrained-bert
library for model loading and management, enabling efficient fine-tuning on the provided MSRA dataset.
Quick Start & Requirements
pip install tensorflow>=1.11.0 torch>=0.4.1 pytorch-pretrained-bert==0.4.0 tqdm apex
apex
is recommended for mixed-precision and distributed training.python build_msra_dataset_tags.py
to prepare the dataset.python train.py
or python train.py --data_dir <path> --bert_model_dir <path> --model_dir <path>
.python evaluate.py
.pytorch-pretrained-bert
: https://github.com/huggingface/pytorch-pretrained-BERTapex
: https://github.com/NVIDIA/apexHighlighted Details
Maintenance & Community
The project is a personal implementation by lemonhu. There are no explicit mentions of active maintenance, community channels (like Discord/Slack), or a public roadmap.
Licensing & Compatibility
The repository does not explicitly state a license. However, it relies on libraries with permissive licenses (PyTorch, TensorFlow, Hugging Face's pytorch-pretrained-bert
). Commercial use would require careful verification of any implicit licensing or dependencies.
Limitations & Caveats
The project specifies compatibility with older versions of PyTorch (0.4.1/1.0.0) and Python 3.5, which may pose challenges for integration with modern ML stacks. The lack of explicit licensing information is a significant caveat for commercial adoption.
2 years ago
1 day