language-models by piegu

Language models and NLP models, pre-trained and fine-tuned

Created 6 years ago

310 stars

Top 86.9% on SourcePulse

Project Summary

This repository offers a collection of pre-trained language models and Natural Language Processing (NLP) tools, primarily focused on Portuguese and French. It provides resources for developers and researchers to leverage advanced NLP capabilities, including LLM interaction, document understanding, speech-to-text, and sentiment analysis, with a strong emphasis on practical applications and fine-tuning.

How It Works

The project showcases various NLP tasks implemented using Hugging Face libraries and pre-trained models. It features fine-tuning scripts for models like BERT and T5 on specific datasets (e.g., SQuAD for QA, LeNER-Br for NER) and demonstrates techniques for accelerating inference. The approach emphasizes practical application through notebooks and web apps, enabling users to replicate or adapt these NLP solutions.

Quick Start & Requirements

Installation: Primarily through provided Python notebooks (e.g., pip install -r requirements.txt implied).
Prerequisites: Python, Hugging Face libraries (transformers, datasets, accelerate), PyTorch/TensorFlow, potentially CUDA for GPU acceleration. Specific notebooks may require additional libraries like unstructured, faster-whisper, neMo.
Resources: GPU recommended for training/fine-tuning and faster inference. Some models are available on Hugging Face Hub.
Links: Numerous notebooks (nbviewer links provided) and blog posts detailing specific implementations.

Highlighted Details

CLI tool HF-LLM.rs for interacting with various LLMs (Llama 3.1, Mistral, Gemma 2).
unstructured library for PDF to JSON/HTML conversion, including tables.
Speech-to-Text with speaker diarization using Whisper and NeMo.
Document AI capabilities with LiLT and LayoutXLM models for layout analysis.
Fine-tuning examples for BERT and T5 models in Portuguese and French for QA, NER, and text classification.
Techniques for accelerating Transformer inference on CPU/GPU.

Maintenance & Community

The repository appears to be a personal collection of projects and tutorials, with a focus on practical NLP applications. No specific community channels (Discord/Slack) or active development team are explicitly mentioned.

Licensing & Compatibility

The repository does not explicitly state a license. The code examples and models referenced are typically under permissive licenses (e.g., MIT, Apache 2.0) from Hugging Face, but users should verify individual component licenses.

Limitations & Caveats

The repository is a collection of notebooks and blog posts, not a unified library. Some notebooks may require significant setup or specific versions of dependencies. Training times for custom models can be substantial, and performance claims are tied to specific hardware and configurations.

language-models by piegu

Explore Similar Projects

deepspeech-german by AASHISHAG

bert-japanese by cl-tohoku

nlp-cheat-sheet-python by janlukasschroeder

nlp_notes by YangBin1729

JARVIS by AlexandreSajus

vits-simple-api by Artrajz

whisper-plus by kadirnar

pororo by kakaobrain

NLP-Tutorials by MorvanZhou

Transformers-for-Natural-Language-Processing by PacktPublishing

seamless_communication by facebookresearch

PaddleSpeech by PaddlePaddle