Spanish language models and resources
Top 98.4% on sourcepulse
This repository provides official Spanish language models and resources developed by BSC-TEMU as part of the Plan-TL initiative. It offers pre-trained transformer models (RoBERTa, Longformer, GPT-2) and a 7B parameter LLM (Ǎguila-7B) trained on extensive Spanish corpora, targeting researchers and developers needing high-quality Spanish NLP capabilities.
How It Works
The project leverages transformer architectures, including RoBERTa and GPT-2, pre-trained on a massive 570GB corpus of Spanish text from the National Library of Spain (2009-2019). This extensive, deduplicated dataset enables robust language understanding and generation. The Ǎguila-7B model builds upon Falcon-7b, incorporating Spanish, Catalan, and English data for broader applicability.
Quick Start & Requirements
Models can be loaded using Hugging Face's transformers
library. For example, to use roberta-base-bne
:
from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
pipeline = FillMaskPipeline(model, tokenizer_hf)
Requires transformers
library and PyTorch/TensorFlow.
Highlighted Details
Maintenance & Community
The project is part of the MarIA project and is associated with Plan-TL (Plan de las Tecnologías del Lenguaje). Contact: plantl-gob-es@bsc.es.
Licensing & Compatibility
Models are generally available for third-party use. The README does not explicitly state a license for the models themselves, but the repository is under the Apache 2.0 license. The disclaimer notes that users are responsible for mitigating risks and complying with regulations.
Limitations & Caveats
The disclaimer warns that models may contain biases or undesirable distortions, and users are responsible for mitigating risks and complying with AI regulations.
2 years ago
1 day