LLaMa2lang by AI-Commandos

CLI tool for finetuning LLaMa3 for non-English chat

Created 2 years ago

314 stars

Top 86.1% on SourcePulse

Project Summary

This repository provides scripts to fine-tune LLaMa3 and other foundation models for non-English languages. It addresses the limitation of base models performing poorly in languages other than English by translating datasets and fine-tuning using QLoRA and PEFT. The target audience includes researchers and developers aiming to create multilingual LLMs.

How It Works

The process involves translating a base dataset (like OASST1) to a target language using various translation models (OPUS M2M, MADLAD, NLLB, etc.). These translated conversations are then structured into prompt threads. Finally, QLoRA and PEFT are used to fine-tune a foundation model on this translated data, with optional Direct Preference Optimization (DPO) or Ordinal Preference Optimization (ORPO) for further refinement. This approach leverages efficient fine-tuning techniques to adapt powerful base models to specific languages.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
PyTorch with CUDA support is recommended.
Translation: python translate.py m2m nl ./output_nl --quant4 --batch_size 20
Combine checkpoints: python combine_checkpoints.py ./output_nl ./hf_dataset_nl
Fine-tune: python finetune.py tuned_model_nl hf_dataset_nl "You are a generic chatbot that always answers in Dutch."
Inference: python run_inference.py tuned_model_nl "You are a generic chatbot that always answers in Dutch." "Wat is de hoofdstad van Nederland?"
See official quick-start for detailed commands.

Highlighted Details

Supports LLaMa3, LLaMa2, Mistral, and Mixtral models.
Offers multiple translation models including OPUS, M2M, MADLAD, NLLB, and Seamless.
Fine-tuning can be done with QLoRA, DPO, and ORPO.
Benchmarking script (benchmark.py) available to compare translation model performance.
Pre-trained adapters for various languages and models are available on Hugging Face Hub.

Maintenance & Community

Actively developed with recent support for LLaMa3.
Community contributions are encouraged via pull requests.
Contact info@commandos.ai for funding inquiries.

Licensing & Compatibility

The repository itself does not specify a license in the README.
Base models (LLaMa, Mistral) have their own licenses, which may restrict commercial use.
Translated datasets and fine-tuned models are often hosted on Hugging Face Hub under various licenses (e.g., Apache 2.0 for some adapters).

Limitations & Caveats

Translation quality can vary significantly between models and languages.
The translation step can be time-consuming (e.g., ~36 hours for OASST1 to a single language on a free Colab T4).
Fine-tuning performance depends heavily on the quality and quantity of the translated data.
Compatibility with other fine-tuning frameworks like Axolotl is mentioned but not detailed.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Translate-It by iSegaro

Smart translation web extension for seamless cross-lingual communication

Created 10 months ago

Updated 5 days ago

HY-MT by Tencent-Hunyuan

Multilingual translation models with advanced features

Created 2 weeks ago

Updated 1 week ago

Starred by

Andrew Kane

Andrew Kane(Author of pgvector).

text2text by artitw

Text2Text toolkit for language modeling tasks

Created 5 years ago

Updated 1 year ago

Modelscope_Faster_Whisper_Multi_Subtitle by v3ucn

Subtitle generator for offline bilingual transcription

Created 1 year ago

Updated 1 year ago

llm-subtrans by machinewrapped

Subtitle translator using LLMs

Created 2 years ago

Updated 1 week ago

subtitle-translator by rockbenben

Subtitle translation tool for batch processing

Created 9 months ago

Updated 6 days ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Etienne Dilocker

Etienne Dilocker(Cofounder of Weaviate), and

1 more.

EasyNMT by UKPLab

EasyNMT: Neural machine translation SDK for 100+ languages

Created 5 years ago

Updated 2 years ago

Starred by

Ed Huang

Ed Huang(Cofounder of PingCAP).

ebook-GPT-translator by jesselau76

CLI tool for ebook translation using OpenAI

Created 2 years ago

Updated 2 years ago

zotero-pdf-translate by windingwind

Zotero plugin for translating PDFs, EPUBs, webpages, metadata, annotations, notes

Created 3 years ago

Updated 3 weeks ago

TranslationPlugin by YiiGuxing

Translation plugin for IntelliJ-based IDEs/Android Studio

Created 9 years ago

Updated 1 week ago

pyvideotrans by jianchang512

Video translation CLI tool

Created 2 years ago

Updated 3 weeks ago

Starred by

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

15 more.

OpenNMT-py by OpenNMT

PyTorch framework for neural machine translation and LLM experimentation

Created 9 years ago

Updated 2 months ago

Feedback? Help us improve.