lingua  by facebookresearch

LLM research codebase for training and inference

Created 11 months ago
4,711 stars

Top 10.5% on SourcePulse

GitHubView on GitHub
Project Summary

Meta Lingua is a minimal and fast PyTorch library for LLM research, enabling end-to-end training, inference, and evaluation. It targets researchers seeking to experiment with novel architectures, losses, and data, offering a lean, easily modifiable codebase for quick iteration and analysis of speed and stability.

How It Works

Lingua provides a modular PyTorch-based framework with core components for model architecture, distributed training (supporting Data Parallel, FSDP, Model Parallelism, torch.compile, activation checkpointing, and float8), data loading, profiling, and checkpoint management. Its distributed.py module is key, abstracting complex parallelism strategies into a single parallelize_module function. Configurations are managed via dataclasses and YAML files, allowing flexible parameter tuning.

Quick Start & Requirements

  • Install via git clone https://github.com/facebookresearch/lingua and bash setup/create_env.sh or sbatch setup/create_env.sh.
  • Requires Python, PyTorch, and SLURM for distributed training.
  • Data preparation involves python setup/download_prepare_hf_data.py and tokenizer download via python setup/download_tokenizer.py.
  • Official docs: https://github.com/facebookresearch/lingua

Highlighted Details

  • Achieves strong performance on downstream tasks, matching DCLM baseline 1.0 for 1B models.
  • Supports various model architectures including Transformer, minGRU, minLSTM, Hawk, and Mamba.
  • Integrates advanced distributed training features like FSDP and torch.compile for efficiency.
  • Offers detailed profiling tools for MFU, HFU, and memory usage.

Maintenance & Community

  • Developed by Meta AI.
  • Positioned as complementary to Torchtitan, Torchtune, and Fairseq2 for different stages of LLM research and development.

Licensing & Compatibility

  • Licensed under BSD-3-Clause.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The project is noted as being under development. Configuration files are templates requiring user adaptation for specific hardware and data paths. Debugging and rapid iteration are facilitated by SLURM job management and local torchrun execution.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 19 hours ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
15 more.

torchtune by pytorch

0.2%
5k
PyTorch library for LLM post-training and experimentation
Created 1 year ago
Updated 1 day ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.2%
7k
Framework for training large-scale autoregressive language models
Created 4 years ago
Updated 2 days ago
Feedback? Help us improve.