parsbert by hooshvare

Persian language model based on Google's BERT architecture

Created 5 years ago

405 stars

Top 71.8% on SourcePulse

Project Summary

ParsBERT is a Transformer-based language model specifically designed for Persian natural language understanding tasks. It offers pre-trained models for sentiment analysis, text classification, and named entity recognition, outperforming existing Persian NLP models and multilingual alternatives.

How It Works

ParsBERT is built upon Google's BERT architecture and pre-trained on a massive Persian corpus exceeding 3.9 million documents. The training process involved extensive pre-processing, including POS tagging and WordPiece segmentation, to handle the nuances of the Persian language, particularly the zero-width non-joiner (ZWNJ) character. This approach ensures robust performance across various downstream NLP tasks.

Quick Start & Requirements

Install via Hugging Face Transformers: from transformers import AutoTokenizer, AutoModel
Model name: "HooshvareLab/bert-fa-zwnj-base" (for v3.0)
Requires Python and the transformers library.
Official Hugging Face models: HooshvareLab

Highlighted Details

State-of-the-art performance on Persian Sentiment Analysis, Text Classification, and Named Entity Recognition benchmarks.
V3.0 model specifically addresses the zero-width non-joiner (ZWNJ) character crucial for Persian text.
Offers derivative models including DistilBERT, ALBERT, and RoBERTa variants for Persian.
Provides fine-tuned models for specific tasks like sentiment analysis on Digikala and SnappFood user comments.

Maintenance & Community

Developed by the Hooshvare Research Group, with active contributors listed (Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri).
Active releases, with v3.0 released in 2021.
Paper available for citation: DOI: 10.1007/s11063-021-10528-4.

Licensing & Compatibility

License: Apache License 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project's latest release (v3.0) was in 2021, indicating potential for newer developments or research not yet incorporated. While benchmarks are provided, specific hardware requirements for fine-tuning or running larger models are not detailed.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days