parsbert  by hooshvare

Persian language model based on Google's BERT architecture

created 5 years ago
383 stars

Top 75.8% on sourcepulse

GitHubView on GitHub
Project Summary

ParsBERT is a Transformer-based language model specifically designed for Persian natural language understanding tasks. It offers pre-trained models for sentiment analysis, text classification, and named entity recognition, outperforming existing Persian NLP models and multilingual alternatives.

How It Works

ParsBERT is built upon Google's BERT architecture and pre-trained on a massive Persian corpus exceeding 3.9 million documents. The training process involved extensive pre-processing, including POS tagging and WordPiece segmentation, to handle the nuances of the Persian language, particularly the zero-width non-joiner (ZWNJ) character. This approach ensures robust performance across various downstream NLP tasks.

Quick Start & Requirements

  • Install via Hugging Face Transformers: from transformers import AutoTokenizer, AutoModel
  • Model name: "HooshvareLab/bert-fa-zwnj-base" (for v3.0)
  • Requires Python and the transformers library.
  • Official Hugging Face models: HooshvareLab

Highlighted Details

  • State-of-the-art performance on Persian Sentiment Analysis, Text Classification, and Named Entity Recognition benchmarks.
  • V3.0 model specifically addresses the zero-width non-joiner (ZWNJ) character crucial for Persian text.
  • Offers derivative models including DistilBERT, ALBERT, and RoBERTa variants for Persian.
  • Provides fine-tuned models for specific tasks like sentiment analysis on Digikala and SnappFood user comments.

Maintenance & Community

  • Developed by the Hooshvare Research Group, with active contributors listed (Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri).
  • Active releases, with v3.0 released in 2021.
  • Paper available for citation: DOI: 10.1007/s11063-021-10528-4.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project's latest release (v3.0) was in 2021, indicating potential for newer developments or research not yet incorporated. While benchmarks are provided, specific hardware requirements for fine-tuning or running larger models are not detailed.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

setfit by huggingface

0.2%
3k
Few-shot learning framework for Sentence Transformers
created 3 years ago
updated 3 months ago
Feedback? Help us improve.