lingua by facebookresearch

LLM research codebase for training and inference

Created 1 year ago

4,748 stars

Top 10.3% on SourcePulse

View on GitHub

9 Experts Love This Project

Théophile Gervet

Cofounder of Genesis AI

Jason Knight

Director AI Compilers at NVIDIA; Cofounder of OctoML

Simon Mo

Core Maintainer of vLLM

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

and 5 more!

Project Summary

Meta Lingua is a minimal and fast PyTorch library for LLM research, enabling end-to-end training, inference, and evaluation. It targets researchers seeking to experiment with novel architectures, losses, and data, offering a lean, easily modifiable codebase for quick iteration and analysis of speed and stability.

How It Works

Lingua provides a modular PyTorch-based framework with core components for model architecture, distributed training (supporting Data Parallel, FSDP, Model Parallelism, torch.compile, activation checkpointing, and float8), data loading, profiling, and checkpoint management. Its distributed.py module is key, abstracting complex parallelism strategies into a single parallelize_module function. Configurations are managed via dataclasses and YAML files, allowing flexible parameter tuning.

Quick Start & Requirements

Install via git clone https://github.com/facebookresearch/lingua and bash setup/create_env.sh or sbatch setup/create_env.sh.
Requires Python, PyTorch, and SLURM for distributed training.
Data preparation involves python setup/download_prepare_hf_data.py and tokenizer download via python setup/download_tokenizer.py.
Official docs: https://github.com/facebookresearch/lingua

Highlighted Details

Achieves strong performance on downstream tasks, matching DCLM baseline 1.0 for 1B models.
Supports various model architectures including Transformer, minGRU, minLSTM, Hawk, and Mamba.
Integrates advanced distributed training features like FSDP and torch.compile for efficiency.
Offers detailed profiling tools for MFU, HFU, and memory usage.

Maintenance & Community

Developed by Meta AI.
Positioned as complementary to Torchtitan, Torchtune, and Fairseq2 for different stages of LLM research and development.

Licensing & Compatibility

Licensed under BSD-3-Clause.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The project is noted as being under development. Configuration files are templates requiring user adaptation for specific hardware and data paths. Debugging and rapid iteration are facilitated by SLURM job management and local torchrun execution.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days