Macadam by yongzhuo

NLP tool for text classification, sequence labeling, and relation extraction

Created 5 years ago

327 stars

Top 83.6% on SourcePulse

Project Summary

Macadam is a Python NLP toolkit built on TensorFlow (Keras) and bert4keras, designed for text classification, sequence labeling, and relation extraction. It supports a wide array of embedding models and numerous algorithms for its target tasks, catering to researchers and practitioners in natural language processing.

How It Works

Macadam leverages the flexibility of TensorFlow/Keras and the advanced capabilities of bert4keras to provide a unified framework for various NLP tasks. It supports a diverse range of embedding strategies, from traditional Word2Vec and FastText to modern transformer-based models like BERT, ALBERT, and RoBERTa. The toolkit offers a modular design, allowing users to easily switch between different network architectures (e.g., TextCNN, Bi-LSTM-CRF) and embedding types for fine-tuning or experimentation.

Quick Start & Requirements

Install via pip: pip install Macadam or pip install -i https://pypi.tuna.tsinghua.edu.cn/simple Macadam
Requires TensorFlow (Keras) and bert4keras.
Supports GPU acceleration (CUDA recommended).
Data format: JSON objects per line for text classification and sequence labeling.
Official documentation and examples are available within the repository.

Highlighted Details

Supports a broad spectrum of embedding models including BERT, ALBERT, RoBERTa, XLNet, and GPT-2.
Offers a rich selection of text classification algorithms like FastText, TextCNN, HAN, and Capsule Networks.
Implements various sequence labeling architectures such as CRF, Bi-LSTM-CRF, and Lattice-LSTM-Batch.
Provides example usage scripts for both text classification and sequence labeling tasks.

Maintenance & Community

The project is authored by Yongzhuo Mo. Further community engagement channels or roadmap details are not explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The provided BibTeX entry suggests it is a general GitHub project. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README indicates that relation extraction (RE) and specific model implementations like TextGCN for text classification and MRC for sequence labeling are still under TODO status. The project appears to be primarily focused on Chinese NLP tasks, with datasets like CLUE NER 2020 and People's Daily corpus mentioned.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days