sentiment_analysis_fine_grain  by brightmart

Multi-label classification with BERT for fine-grained sentiment analysis

created 6 years ago
592 stars

Top 55.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools for fine-grained, multi-label sentiment analysis using BERT and TextCNN models. It targets researchers and developers working with Chinese text data, offering a complete pipeline from data preprocessing and model training to deployment. The project aims to simplify the implementation of advanced NLP techniques for sentiment analysis tasks.

How It Works

The project leverages the BERT architecture for its powerful contextual embeddings, enabling nuanced understanding of text for multi-label classification. It also includes TextCNN as a baseline, offering a more traditional CNN approach. The workflow involves data preprocessing (character or word level), optional pre-training of BERT or TextCNN on custom corpora, fine-tuning for the sentiment analysis task, and deployment for online prediction.

Quick Start & Requirements

  • BERT: Requires pre-trained Chinese BERT model (chinese_L-12_H-768_A-12.zip) and processed training/validation data.
  • Installation: Primarily uses Python scripts. Environment setup involves downloading specific model checkpoints and data.
  • Commands:
    • BERT Classification: nohup python run_classifier_multi_labels_bert.py ...
    • BERT Pre-training: nohup python create_pretraining_data.py ... followed by python run_pretraining.py
    • TextCNN Training: python train_cnn_fine_grain.py
  • Resources: Requires downloading large pre-trained models and datasets. Specific paths need to be set via environment variables (BERT_BASE_DIR, TEXT_DIR).
  • Links:

Highlighted Details

  • Supports multi-label classification, a more complex sentiment analysis scenario.
  • Offers a complete pipeline including data preprocessing scripts (preprocess_char.ipynb, preprocess_word.ipynb).
  • Includes instructions for pre-training BERT and TextCNN on custom data for improved performance.
  • Provides guidance for deploying BERT models for online prediction.

Maintenance & Community

  • The repository is associated with AI Challenger 2018.
  • Key references include the original BERT paper and implementations. No explicit community channels (Discord/Slack) or active maintenance signals are present in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The project references google-research/bert and pengshuang/AI-Comp, suggesting potential Apache 2.0 or similar permissive licenses, but this requires verification.

Limitations & Caveats

The project focuses specifically on Chinese text and may require significant adaptation for other languages. The performance metrics for BERT models are marked as "ADD A NUMBER HERE," indicating incomplete benchmark data within the README. The setup involves downloading large files and setting environment variables, which can be cumbersome.

Health Check
Last commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.