sentiment_analysis_fine_grain by brightmart

Multi-label classification with BERT for fine-grained sentiment analysis

Created 7 years ago

594 stars

Top 54.8% on SourcePulse

Project Summary

This repository provides tools for fine-grained, multi-label sentiment analysis using BERT and TextCNN models. It targets researchers and developers working with Chinese text data, offering a complete pipeline from data preprocessing and model training to deployment. The project aims to simplify the implementation of advanced NLP techniques for sentiment analysis tasks.

How It Works

The project leverages the BERT architecture for its powerful contextual embeddings, enabling nuanced understanding of text for multi-label classification. It also includes TextCNN as a baseline, offering a more traditional CNN approach. The workflow involves data preprocessing (character or word level), optional pre-training of BERT or TextCNN on custom corpora, fine-tuning for the sentiment analysis task, and deployment for online prediction.

Quick Start & Requirements

BERT: Requires pre-trained Chinese BERT model (chinese_L-12_H-768_A-12.zip) and processed training/validation data.
Installation: Primarily uses Python scripts. Environment setup involves downloading specific model checkpoints and data.
Commands:
- BERT Classification: nohup python run_classifier_multi_labels_bert.py ...
- BERT Pre-training: nohup python create_pretraining_data.py ... followed by python run_pretraining.py
- TextCNN Training: python train_cnn_fine_grain.py
Resources: Requires downloading large pre-trained models and datasets. Specific paths need to be set via environment variables (BERT_BASE_DIR, TEXT_DIR).
Links:
- Pre-trained BERT: https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
- Sample Data: https://pan.baidu.com/s/1ZS4dAdOIAe3DaHiwCDrLKw

Highlighted Details

Supports multi-label classification, a more complex sentiment analysis scenario.
Offers a complete pipeline including data preprocessing scripts (preprocess_char.ipynb, preprocess_word.ipynb).
Includes instructions for pre-training BERT and TextCNN on custom data for improved performance.
Provides guidance for deploying BERT models for online prediction.

Maintenance & Community

The repository is associated with AI Challenger 2018.
Key references include the original BERT paper and implementations. No explicit community channels (Discord/Slack) or active maintenance signals are present in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project references google-research/bert and pengshuang/AI-Comp, suggesting potential Apache 2.0 or similar permissive licenses, but this requires verification.

Limitations & Caveats

The project focuses specifically on Chinese text and may require significant adaptation for other languages. The performance metrics for BERT models are marked as "ADD A NUMBER HERE," indicating incomplete benchmark data within the README. The setup involves downloading large files and setting environment variables, which can be cumbersome.

Health Check

Last Commit

7 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days