flame by fla-org

Minimal, efficient framework for LLM training

Created 10 months ago

311 stars

Top 86.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Wing Lian

Founder of Axolotl AI

Project Summary

Summary

Flame is a minimal, efficient training framework built on torchtitan for scaling large language models (LLMs). It targets engineers and researchers seeking high performance and ease of use, offering features like zero-cost data preprocessing and advanced parallelism for faster LLM development.

How It Works

Flame leverages torchtitan to provide a streamlined training experience. Its core design emphasizes efficiency through zero-cost data preprocessing, including online tokenization and dataset shuffling, and supports multiple datasets. The framework is built for scalability, with features like 4D parallelism planned for future releases, aiming to accelerate LLM training pipelines.

Quick Start & Requirements

Installation involves cloning the repository and running pip install .. Key dependencies include specific versions of flash-linear-attention and torchtitan (commit 0b44d4c). Dataset preparation utilizes the datasets library for loading, such as HuggingFaceFW/fineweb-edu. Training is initiated via bash train.sh, configurable with numerous command-line arguments. Recommended for torch.compile usage are torch>=2.6 and triton>=3.0. Multi-node training is supported, with environment variables like MASTER_ADDR and MASTER_PORT needing configuration or handled by job schedulers.

Highlighted Details

Zero-Cost Data Preprocessing: Enables online tokenization, dataset shuffling, and support for multiple datasets without upfront processing costs.
Variable-Length Training: Utilizes --training.varlen to pack variable-length documents into fixed sequences, eliminating padding and improving efficiency.
torch.compile Integration: Supports PyTorch 2.0+ compilation via --training.compile for potential speedups, though potential conflicts with fused kernels exist.
Advanced Parallelism: Features include support for tensor parallelism, pipeline parallelism (requiring manual split point specification), and planned 4D parallelism.
Checkpointing & Conversion: Manages distributed checkpoints (DCPs) and provides scripts to convert between DCP and Hugging Face formats for seamless training resumption and model sharing.
Float8 Support: Integrates Float8 precision via torchao for potential memory and speed benefits.

Maintenance & Community

The provided README does not detail specific community channels (e.g., Discord, Slack), active maintainers beyond the authors listed in the citation, or sponsorship information.

Licensing & Compatibility

The repository's license is not specified in the provided README content. This lack of information presents an adoption blocker, particularly for commercial use or integration into closed-source projects.

Limitations & Caveats

The integration of torch.compile may encounter conflicts with Flame's fused kernels, requiring up-to-date dependencies. Dataset streaming can be unstable due to network dependencies; local downloads are recommended for reliable training. 4D parallelism is listed as "coming soon." Pipeline parallelism requires manual definition of split points. The absence of explicit licensing information is a significant caveat.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

34 stars in the last 30 days