lm-spanish  by PlanTL-GOB-ES

Spanish language models and resources

created 4 years ago
259 stars

Top 98.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides official Spanish language models and resources developed by BSC-TEMU as part of the Plan-TL initiative. It offers pre-trained transformer models (RoBERTa, Longformer, GPT-2) and a 7B parameter LLM (Ǎguila-7B) trained on extensive Spanish corpora, targeting researchers and developers needing high-quality Spanish NLP capabilities.

How It Works

The project leverages transformer architectures, including RoBERTa and GPT-2, pre-trained on a massive 570GB corpus of Spanish text from the National Library of Spain (2009-2019). This extensive, deduplicated dataset enables robust language understanding and generation. The Ǎguila-7B model builds upon Falcon-7b, incorporating Spanish, Catalan, and English data for broader applicability.

Quick Start & Requirements

Models can be loaded using Hugging Face's transformers library. For example, to use roberta-base-bne:

from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
pipeline = FillMaskPipeline(model, tokenizer_hf)

Requires transformers library and PyTorch/TensorFlow.

Highlighted Details

  • Offers RoBERTa and GPT-2 models fine-tuned for specific tasks like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.
  • Includes word embeddings (CBOW, Skip-Gram) trained with Floret and FastText on general and domain-specific corpora.
  • Provides spaCy models for BioNER tasks (tumour morphology, substances/compounds/proteins).
  • Features the EvalES benchmark for evaluating Spanish NLP systems across 10 diverse tasks.

Maintenance & Community

The project is part of the MarIA project and is associated with Plan-TL (Plan de las Tecnologías del Lenguaje). Contact: plantl-gob-es@bsc.es.

Licensing & Compatibility

Models are generally available for third-party use. The README does not explicitly state a license for the models themselves, but the repository is under the Apache 2.0 license. The disclaimer notes that users are responsible for mitigating risks and complying with regulations.

Limitations & Caveats

The disclaimer warns that models may contain biases or undesirable distortions, and users are responsible for mitigating risks and complying with AI regulations.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.