lingua  by facebookresearch

LLM research codebase for training and inference

created 9 months ago
4,668 stars

Top 10.8% on sourcepulse

GitHubView on GitHub
Project Summary

Meta Lingua is a minimal and fast PyTorch library for LLM research, enabling end-to-end training, inference, and evaluation. It targets researchers seeking to experiment with novel architectures, losses, and data, offering a lean, easily modifiable codebase for quick iteration and analysis of speed and stability.

How It Works

Lingua provides a modular PyTorch-based framework with core components for model architecture, distributed training (supporting Data Parallel, FSDP, Model Parallelism, torch.compile, activation checkpointing, and float8), data loading, profiling, and checkpoint management. Its distributed.py module is key, abstracting complex parallelism strategies into a single parallelize_module function. Configurations are managed via dataclasses and YAML files, allowing flexible parameter tuning.

Quick Start & Requirements

  • Install via git clone https://github.com/facebookresearch/lingua and bash setup/create_env.sh or sbatch setup/create_env.sh.
  • Requires Python, PyTorch, and SLURM for distributed training.
  • Data preparation involves python setup/download_prepare_hf_data.py and tokenizer download via python setup/download_tokenizer.py.
  • Official docs: https://github.com/facebookresearch/lingua

Highlighted Details

  • Achieves strong performance on downstream tasks, matching DCLM baseline 1.0 for 1B models.
  • Supports various model architectures including Transformer, minGRU, minLSTM, Hawk, and Mamba.
  • Integrates advanced distributed training features like FSDP and torch.compile for efficiency.
  • Offers detailed profiling tools for MFU, HFU, and memory usage.

Maintenance & Community

  • Developed by Meta AI.
  • Positioned as complementary to Torchtitan, Torchtune, and Fairseq2 for different stages of LLM research and development.

Licensing & Compatibility

  • Licensed under BSD-3-Clause.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The project is noted as being under development. Configuration files are templates requiring user adaptation for specific hardware and data paths. Debugging and rapid iteration are facilitated by SLURM job management and local torchrun execution.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
2
Star History
160 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Logan Kilpatrick Logan Kilpatrick(Product Lead on Google AI Studio), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

catalyst by catalyst-team

0%
3k
PyTorch framework for accelerated deep learning R&D
created 7 years ago
updated 1 month ago
Feedback? Help us improve.