arabert  by aub-mind

Pre-trained Transformers for Arabic language tasks

created 5 years ago
676 stars

Top 51.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides pre-trained Transformer models for Arabic Natural Language Processing, including AraBERT, AraGPT2, and AraELECTRA. It targets researchers and developers working with Arabic text, offering robust models for understanding and generation tasks, with improved versions and specialized models for dialects and tweets.

How It Works

The project leverages the Transformer architecture, pre-training models from scratch on extensive Arabic datasets. AraBERT variants utilize Masked Language Modeling (MLM), while AraGPT2 focuses on causal language modeling for generation. AraELECTRA employs a discriminator-based pre-training approach for efficient understanding tasks. The models are trained on large corpora, including filtered OSCAR, Wikipedia, and news articles, with improved tokenization and preprocessing for better handling of Arabic nuances like dialects and emojis.

Quick Start & Requirements

  • Install via pip: pip install arabert
  • Usage example: from arabert import ArabertPreprocessor
  • Models are available on HuggingFace under aubmindlab.
  • Checkpoints available in PyTorch, TF2, and TF1 formats.
  • Official demo space on HuggingFace: https://huggingface.co/spaces/aubmindlab/Arabic-NLP-Demo
  • Colab notebooks are available in the examples/ folder.

Highlighted Details

  • AraBERTv2 models feature improved preprocessing and a new vocabulary addressing issues with punctuation and numbers.
  • AraBERTv0.2-Twitter models are specifically trained on ~60M Arabic tweets, supporting dialects and emojis.
  • AraGPT2 offers variants from base to mega, suitable for Arabic text generation.
  • AraELECTRA provides discriminator models for Arabic language understanding.

Maintenance & Community

  • The project is actively maintained by AUB MIND Lab members.
  • Key contributors: Wissam Antoun, Fady Baly, Hazem Hajj.
  • Contact information and social links (LinkedIn, Twitter, GitHub) for contributors are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license. However, the models are distributed via HuggingFace, which typically uses Apache 2.0 or similar permissive licenses for model weights.
  • Compatibility for commercial use is likely, but users should verify the specific license of the HuggingFace model weights they use.

Limitations & Caveats

  • The repository does not explicitly state a license, which may cause concern for commercial adoption.
  • Older versions (AraBERTv1) have known issues with wordpiece vocabulary related to punctuation and numbers, though v2 addresses this.
Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.