lm-spanish by PlanTL-GOB-ES

Spanish language models and resources

Created 4 years ago

260 stars

Top 97.7% on SourcePulse

Project Summary

This repository provides official Spanish language models and resources developed by BSC-TEMU as part of the Plan-TL initiative. It offers pre-trained transformer models (RoBERTa, Longformer, GPT-2) and a 7B parameter LLM (Ǎguila-7B) trained on extensive Spanish corpora, targeting researchers and developers needing high-quality Spanish NLP capabilities.

How It Works

The project leverages transformer architectures, including RoBERTa and GPT-2, pre-trained on a massive 570GB corpus of Spanish text from the National Library of Spain (2009-2019). This extensive, deduplicated dataset enables robust language understanding and generation. The Ǎguila-7B model builds upon Falcon-7b, incorporating Spanish, Catalan, and English data for broader applicability.

Quick Start & Requirements

Models can be loaded using Hugging Face's transformers library. For example, to use roberta-base-bne:

from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
pipeline = FillMaskPipeline(model, tokenizer_hf)

Requires transformers library and PyTorch/TensorFlow.

Highlighted Details

Offers RoBERTa and GPT-2 models fine-tuned for specific tasks like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.
Includes word embeddings (CBOW, Skip-Gram) trained with Floret and FastText on general and domain-specific corpora.
Provides spaCy models for BioNER tasks (tumour morphology, substances/compounds/proteins).
Features the EvalES benchmark for evaluating Spanish NLP systems across 10 diverse tasks.

Maintenance & Community

The project is part of the MarIA project and is associated with Plan-TL (Plan de las Tecnologías del Lenguaje). Contact: plantl-gob-es@bsc.es.

Licensing & Compatibility

Models are generally available for third-party use. The README does not explicitly state a license for the models themselves, but the repository is under the Apache 2.0 license. The disclaimer notes that users are responsible for mitigating risks and complying with regulations.

Limitations & Caveats

The disclaimer warns that models may contain biases or undesirable distortions, and users are responsible for mitigating risks and complying with AI regulations.

lm-spanish by PlanTL-GOB-ES

Explore Similar Projects

Portuguese-NLP by ajdavidl

awesome-japanese-llm by llm-jp

text2text by artitw

Optimus by ChunyuanLI

parsbert by hooshvare

LMkor by kiyoungkim1

KoELECTRA by monologg

Chinese-ELECTRA by ymcui

text by pytorch

xlnet by zihangdai

unilm by microsoft

lectures by oxford-cs-deepnlp-2017