arabert by aub-mind

Pre-trained Transformers for Arabic language tasks

Created 5 years ago

706 stars

Top 48.5% on SourcePulse

1 Expert Loves This Project

julien-c

Julien Chaumond

Cofounder of Hugging Face

Project Summary

This repository provides pre-trained Transformer models for Arabic Natural Language Processing, including AraBERT, AraGPT2, and AraELECTRA. It targets researchers and developers working with Arabic text, offering robust models for understanding and generation tasks, with improved versions and specialized models for dialects and tweets.

How It Works

The project leverages the Transformer architecture, pre-training models from scratch on extensive Arabic datasets. AraBERT variants utilize Masked Language Modeling (MLM), while AraGPT2 focuses on causal language modeling for generation. AraELECTRA employs a discriminator-based pre-training approach for efficient understanding tasks. The models are trained on large corpora, including filtered OSCAR, Wikipedia, and news articles, with improved tokenization and preprocessing for better handling of Arabic nuances like dialects and emojis.

Quick Start & Requirements

Install via pip: pip install arabert
Usage example: from arabert import ArabertPreprocessor
Models are available on HuggingFace under aubmindlab.
Checkpoints available in PyTorch, TF2, and TF1 formats.
Official demo space on HuggingFace: https://huggingface.co/spaces/aubmindlab/Arabic-NLP-Demo
Colab notebooks are available in the examples/ folder.

Highlighted Details

AraBERTv2 models feature improved preprocessing and a new vocabulary addressing issues with punctuation and numbers.
AraBERTv0.2-Twitter models are specifically trained on ~60M Arabic tweets, supporting dialects and emojis.
AraGPT2 offers variants from base to mega, suitable for Arabic text generation.
AraELECTRA provides discriminator models for Arabic language understanding.

Maintenance & Community

The project is actively maintained by AUB MIND Lab members.
Key contributors: Wissam Antoun, Fady Baly, Hazem Hajj.
Contact information and social links (LinkedIn, Twitter, GitHub) for contributors are provided.

Licensing & Compatibility

The repository does not explicitly state a license. However, the models are distributed via HuggingFace, which typically uses Apache 2.0 or similar permissive licenses for model weights.
Compatibility for commercial use is likely, but users should verify the specific license of the HuggingFace model weights they use.

Limitations & Caveats

The repository does not explicitly state a license, which may cause concern for commercial adoption.
Older versions (AraBERTv1) have known issues with wordpiece vocabulary related to punctuation and numbers, though v2 addresses this.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

awesome-japanese-llm by llm-jp

Japanese LLM list: models, benchmarks, datasets

Created 2 years ago

Updated 1 day ago

Starred by

Andrew Kane

Andrew Kane(Author of pgvector).

text2text by artitw

Text2Text toolkit for language modeling tasks

Created 5 years ago

Updated 1 year ago

parsbert by hooshvare

Persian language model based on Google's BERT architecture

Created 5 years ago

Updated 2 years ago

LMkor by kiyoungkim1

Korean language models for NLP tasks

Created 5 years ago

Updated 3 years ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

bert-japanese by cl-tohoku

Pretrained BERT models for Japanese text

Created 6 years ago

Updated 1 year ago

Starred by

Chenlin Meng

Chenlin Meng(Cofounder of Pika),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

1 more.

Multilingual-CLIP by FreddeFrallan

Multilingual text encoders leveraging OpenAI's CLIP model

Created 4 years ago

Updated 2 years ago

Unilm by YunwenTechnology

Chinese UniLM base model for NLU and NLG tasks

Created 5 years ago

Updated 3 years ago

KoELECTRA by monologg

Pretrained ELECTRA model for Korean language tasks

Created 5 years ago

Updated 1 year ago

Mastering-Transformers by PacktPublishing

Code repository for NLP book "Mastering Transformers"

Created 4 years ago

Updated 3 weeks ago

NLP-Tutorials by MorvanZhou

NLP tutorial with simple implementations of models

Created 7 years ago

Updated 2 years ago

gpt2-ml by imcaspar

GPT-2 for multiple languages, including pretrained models

Created 6 years ago

Updated 2 years ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

29 more.

bert by google-research

TensorFlow code and pre-trained models for BERT

Created 7 years ago

Updated 1 year ago

Feedback? Help us improve.