Text_Classification  by kk7nc

Survey paper for text classification algorithms

created 7 years ago
1,819 stars

Top 24.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive survey of text classification algorithms, covering traditional machine learning methods and modern deep learning architectures. It serves as a valuable resource for researchers and practitioners looking to understand and implement various text classification techniques, offering code examples and comparative analyses.

How It Works

The project systematically explores text preprocessing, feature extraction methods (like TF-IDF, Word2Vec, GloVe, ELMo, FastText), and dimensionality reduction techniques (PCA, LDA, NMF, Random Projection, Autoencoders, t-SNE). It then details numerous classification algorithms, including Rocchio, Boosting/Bagging, Naive Bayes, k-NN, SVM, Decision Trees, Random Forests, CRFs, DNNs, RNNs (GRU, LSTM), CNNs, RCNNs, and Hierarchical Attention Networks. Each section includes theoretical explanations, code snippets, and performance evaluations.

Quick Start & Requirements

  • Installation: pip install RMDL or git clone --recursive https://github.com/kk7nc/RMDL.git followed by pip install -r requirements.txt.
  • Prerequisites: Python 3, TensorFlow.
  • Data: The repository utilizes standard datasets like IMDB, Reuters-21578, 20Newsgroups, and custom Web of Science datasets.

Highlighted Details

  • Comprehensive coverage of both classical and state-of-the-art NLP classification methods.
  • Includes detailed code examples for implementing various models using scikit-learn, Keras, and TensorFlow.
  • Provides comparative analysis and performance metrics (F1 score, MCC, ROC/AUC) for different algorithms.
  • Explores advanced deep learning architectures like RCNN, HAN, and ensemble methods like RMDL.

Maintenance & Community

The project is associated with the paper "Text Classification Algorithms: A Survey" published in the journal Information. Links to the paper, arXiv, and related resources are provided. The repository structure suggests active development and research contributions.

Licensing & Compatibility

The repository includes a LICENSE file, indicating it is likely available under an open-source license. Specific license details are not immediately prominent in the README but are typically found in the root directory.

Limitations & Caveats

The README focuses heavily on presenting a broad overview and code examples, with less emphasis on practical setup for specific use cases or detailed error handling. Some code snippets reference local file paths (e.g., for GloVe embeddings) that may require adjustment. The sheer volume of covered algorithms might lead to a steep learning curve for newcomers.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.