language-models  by piegu

Language models and NLP models, pre-trained and fine-tuned

created 5 years ago
306 stars

Top 88.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository offers a collection of pre-trained language models and Natural Language Processing (NLP) tools, primarily focused on Portuguese and French. It provides resources for developers and researchers to leverage advanced NLP capabilities, including LLM interaction, document understanding, speech-to-text, and sentiment analysis, with a strong emphasis on practical applications and fine-tuning.

How It Works

The project showcases various NLP tasks implemented using Hugging Face libraries and pre-trained models. It features fine-tuning scripts for models like BERT and T5 on specific datasets (e.g., SQuAD for QA, LeNER-Br for NER) and demonstrates techniques for accelerating inference. The approach emphasizes practical application through notebooks and web apps, enabling users to replicate or adapt these NLP solutions.

Quick Start & Requirements

  • Installation: Primarily through provided Python notebooks (e.g., pip install -r requirements.txt implied).
  • Prerequisites: Python, Hugging Face libraries (transformers, datasets, accelerate), PyTorch/TensorFlow, potentially CUDA for GPU acceleration. Specific notebooks may require additional libraries like unstructured, faster-whisper, neMo.
  • Resources: GPU recommended for training/fine-tuning and faster inference. Some models are available on Hugging Face Hub.
  • Links: Numerous notebooks (nbviewer links provided) and blog posts detailing specific implementations.

Highlighted Details

  • CLI tool HF-LLM.rs for interacting with various LLMs (Llama 3.1, Mistral, Gemma 2).
  • unstructured library for PDF to JSON/HTML conversion, including tables.
  • Speech-to-Text with speaker diarization using Whisper and NeMo.
  • Document AI capabilities with LiLT and LayoutXLM models for layout analysis.
  • Fine-tuning examples for BERT and T5 models in Portuguese and French for QA, NER, and text classification.
  • Techniques for accelerating Transformer inference on CPU/GPU.

Maintenance & Community

The repository appears to be a personal collection of projects and tutorials, with a focus on practical NLP applications. No specific community channels (Discord/Slack) or active development team are explicitly mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. The code examples and models referenced are typically under permissive licenses (e.g., MIT, Apache 2.0) from Hugging Face, but users should verify individual component licenses.

Limitations & Caveats

The repository is a collection of notebooks and blog posts, not a unified library. Some notebooks may require significant setup or specific versions of dependencies. Training times for custom models can be substantial, and performance claims are tied to specific hardware and configurations.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.