pytorch_neural_crf  by allanj

NER tool using LSTM/BERT-CRF, achieving SOTA performance

created 6 years ago
379 stars

Top 76.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of LSTM-BERT-CRF and BERT-CRF models for Named Entity Recognition (NER) and sequence labeling tasks. It targets researchers and practitioners seeking state-of-the-art performance on standard NER datasets, offering efficient training and inference capabilities.

How It Works

The core of the implementation is a Conditional Random Field (CRF) layer built on top of a BiLSTM or a fine-tuned BERT/RoBERTa encoder. This architecture captures sequential dependencies and contextual information effectively. A key innovation is a "Faster CRF" module enabling O(log N) inference and backtracking, significantly improving decoding speed. The project also supports distributed training via HuggingFace's accelerate for faster model training on large datasets.

Quick Start & Requirements

  • Install: pip install transformers datasets accelerate seqeval
  • Prerequisites: Python >= 3.6, PyTorch >= 1.6.0, CUDA (for GPU acceleration).
  • Usage: Fine-tune BERT/RoBERTa by setting embedder_type (e.g., roberta-base) in transformers_trainer.py. For distributed training, use accelerate launch transformers_trainer_ddp.py.
  • Docs: README

Highlighted Details

  • Achieves SOTA performance on CoNLL-2003 and OntoNotes 5.0 datasets with BERT-base-cased and RoBERTa-base.
  • Implements a faster CRF module for O(log N) inference.
  • Supports distributed training using HuggingFace accelerate.
  • Offers flexibility to use BERT/RoBERTa as contextualized embeddings or fine-tune them directly.

Maintenance & Community

The project appears actively developed, with recent updates and planned features like pre-trained model releases and Semi-CRF support. No specific community channels (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The project requires specific versions of PyTorch and Python. Tokenization mechanisms may need adjustment for non-default HuggingFace models. Pre-trained models are not yet released.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.