bert-utils by terrifyzhao

BERT utility for sentence embeddings, text classification, and similarity

Created 7 years ago

1,672 stars

Top 25.1% on SourcePulse

Project Summary

This repository provides simplified utilities for leveraging Google's BERT model for generating sentence embeddings and performing text classification. It is designed for developers and researchers looking for a streamlined way to integrate BERT's capabilities into their NLP pipelines, offering faster sentence vector generation and a straightforward fine-tuning process for classification tasks.

How It Works

The library builds upon Google's open-source BERT implementation, focusing on ease of use. For sentence embeddings, it optimizes the graph file generation process for faster startup times by caching the graph. For text classification, it facilitates fine-tuning using TensorFlow's Estimator API, requiring data to be formatted into train.csv, dev.csv, and test.csv files.

Quick Start & Requirements

Install: Not explicitly mentioned, but implies cloning the repository and using its Python scripts.
Prerequisites:
- Download BERT Chinese model: chinese_L-12_H-768_A-12.zip from Google Storage.
- TensorFlow.
- Python.
Usage:
- Sentence Embeddings: from bert.extrac_feature import BertVector; bv = BertVector(); bv.encode(['text'])
- Text Classification: from similarity import BertSim; bs = BertSim(); bs.set_mode(...); bs.train()/eval()/test()
Data: Includes QA_corpus dataset for text matching.

Highlighted Details

Optimized sentence vector graph file generation for faster startup.
Fixed bug for concurrent sentence vector generation processes.
Uses QA_corpus dataset for text matching, noted as more authoritative.

Maintenance & Community

Last updated July 1st, 2019. No community links or active maintenance signals are present in the README.

Licensing & Compatibility

The README does not specify a license. It is based on Google's BERT code, which is typically Apache 2.0 licensed, but this is not confirmed for this specific utility wrapper.

Limitations & Caveats

The project's last update was in 2019, suggesting it may not incorporate recent BERT advancements or address newer TensorFlow/Keras API changes. The lack of explicit licensing information could pose compatibility issues for commercial use.

Health Check

Last Commit

6 years ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days