parsbert  by hooshvare

Persian language model based on Google's BERT architecture

Created 5 years ago
397 stars

Top 72.6% on SourcePulse

GitHubView on GitHub
Project Summary

ParsBERT is a Transformer-based language model specifically designed for Persian natural language understanding tasks. It offers pre-trained models for sentiment analysis, text classification, and named entity recognition, outperforming existing Persian NLP models and multilingual alternatives.

How It Works

ParsBERT is built upon Google's BERT architecture and pre-trained on a massive Persian corpus exceeding 3.9 million documents. The training process involved extensive pre-processing, including POS tagging and WordPiece segmentation, to handle the nuances of the Persian language, particularly the zero-width non-joiner (ZWNJ) character. This approach ensures robust performance across various downstream NLP tasks.

Quick Start & Requirements

  • Install via Hugging Face Transformers: from transformers import AutoTokenizer, AutoModel
  • Model name: "HooshvareLab/bert-fa-zwnj-base" (for v3.0)
  • Requires Python and the transformers library.
  • Official Hugging Face models: HooshvareLab

Highlighted Details

  • State-of-the-art performance on Persian Sentiment Analysis, Text Classification, and Named Entity Recognition benchmarks.
  • V3.0 model specifically addresses the zero-width non-joiner (ZWNJ) character crucial for Persian text.
  • Offers derivative models including DistilBERT, ALBERT, and RoBERTa variants for Persian.
  • Provides fine-tuned models for specific tasks like sentiment analysis on Digikala and SnappFood user comments.

Maintenance & Community

  • Developed by the Hooshvare Research Group, with active contributors listed (Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri).
  • Active releases, with v3.0 released in 2021.
  • Paper available for citation: DOI: 10.1007/s11063-021-10528-4.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project's latest release (v3.0) was in 2021, indicating potential for newer developments or research not yet incorporated. While benchmarks are provided, specific hardware requirements for fine-tuning or running larger models are not detailed.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Andrew Kane Andrew Kane(Author of pgvector), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
11 more.

xlnet by zihangdai

0%
6k
Language model research paper using generalized autoregressive pretraining
Created 6 years ago
Updated 2 years ago
Feedback? Help us improve.