Multi-label classification with BERT for fine-grained sentiment analysis
Top 55.7% on sourcepulse
This repository provides tools for fine-grained, multi-label sentiment analysis using BERT and TextCNN models. It targets researchers and developers working with Chinese text data, offering a complete pipeline from data preprocessing and model training to deployment. The project aims to simplify the implementation of advanced NLP techniques for sentiment analysis tasks.
How It Works
The project leverages the BERT architecture for its powerful contextual embeddings, enabling nuanced understanding of text for multi-label classification. It also includes TextCNN as a baseline, offering a more traditional CNN approach. The workflow involves data preprocessing (character or word level), optional pre-training of BERT or TextCNN on custom corpora, fine-tuning for the sentiment analysis task, and deployment for online prediction.
Quick Start & Requirements
chinese_L-12_H-768_A-12.zip
) and processed training/validation data.nohup python run_classifier_multi_labels_bert.py ...
nohup python create_pretraining_data.py ...
followed by python run_pretraining.py
python train_cnn_fine_grain.py
BERT_BASE_DIR
, TEXT_DIR
).Highlighted Details
preprocess_char.ipynb
, preprocess_word.ipynb
).Maintenance & Community
Licensing & Compatibility
google-research/bert
and pengshuang/AI-Comp
, suggesting potential Apache 2.0 or similar permissive licenses, but this requires verification.Limitations & Caveats
The project focuses specifically on Chinese text and may require significant adaptation for other languages. The performance metrics for BERT models are marked as "ADD A NUMBER HERE," indicating incomplete benchmark data within the README. The setup involves downloading large files and setting environment variables, which can be cumbersome.
6 years ago
Inactive