lm-spanish  by PlanTL-GOB-ES

Spanish language models and resources

Created 4 years ago
259 stars

Top 97.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides official Spanish language models and resources developed by BSC-TEMU as part of the Plan-TL initiative. It offers pre-trained transformer models (RoBERTa, Longformer, GPT-2) and a 7B parameter LLM (Ǎguila-7B) trained on extensive Spanish corpora, targeting researchers and developers needing high-quality Spanish NLP capabilities.

How It Works

The project leverages transformer architectures, including RoBERTa and GPT-2, pre-trained on a massive 570GB corpus of Spanish text from the National Library of Spain (2009-2019). This extensive, deduplicated dataset enables robust language understanding and generation. The Ǎguila-7B model builds upon Falcon-7b, incorporating Spanish, Catalan, and English data for broader applicability.

Quick Start & Requirements

Models can be loaded using Hugging Face's transformers library. For example, to use roberta-base-bne:

from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
pipeline = FillMaskPipeline(model, tokenizer_hf)

Requires transformers library and PyTorch/TensorFlow.

Highlighted Details

  • Offers RoBERTa and GPT-2 models fine-tuned for specific tasks like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.
  • Includes word embeddings (CBOW, Skip-Gram) trained with Floret and FastText on general and domain-specific corpora.
  • Provides spaCy models for BioNER tasks (tumour morphology, substances/compounds/proteins).
  • Features the EvalES benchmark for evaluating Spanish NLP systems across 10 diverse tasks.

Maintenance & Community

The project is part of the MarIA project and is associated with Plan-TL (Plan de las Tecnologías del Lenguaje). Contact: plantl-gob-es@bsc.es.

Licensing & Compatibility

Models are generally available for third-party use. The README does not explicitly state a license for the models themselves, but the repository is under the Apache 2.0 license. The disclaimer notes that users are responsible for mitigating risks and complying with regulations.

Limitations & Caveats

The disclaimer warns that models may contain biases or undesirable distortions, and users are responsible for mitigating risks and complying with AI regulations.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0.0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 week ago
Starred by Andrew Kane Andrew Kane(Author of pgvector), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
11 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
Created 6 years ago
Updated 2 years ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
18 more.

lectures by oxford-cs-deepnlp-2017

0.0%
16k
NLP course (lecture slides) for deep learning approaches to language
Created 8 years ago
Updated 2 years ago
Feedback? Help us improve.