Discover and explore top open-source AI tools and projects—updated daily.
Pre-trained Transformers for Arabic language tasks
Top 49.5% on SourcePulse
This repository provides pre-trained Transformer models for Arabic Natural Language Processing, including AraBERT, AraGPT2, and AraELECTRA. It targets researchers and developers working with Arabic text, offering robust models for understanding and generation tasks, with improved versions and specialized models for dialects and tweets.
How It Works
The project leverages the Transformer architecture, pre-training models from scratch on extensive Arabic datasets. AraBERT variants utilize Masked Language Modeling (MLM), while AraGPT2 focuses on causal language modeling for generation. AraELECTRA employs a discriminator-based pre-training approach for efficient understanding tasks. The models are trained on large corpora, including filtered OSCAR, Wikipedia, and news articles, with improved tokenization and preprocessing for better handling of Arabic nuances like dialects and emojis.
Quick Start & Requirements
pip install arabert
from arabert import ArabertPreprocessor
aubmindlab
.examples/
folder.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 years ago
Inactive