ml-cross-entropy by apple

PyTorch module for memory-efficient cross-entropy in LLMs

Created 1 year ago

575 stars

Top 56.2% on SourcePulse

View on GitHub

6 Experts Love This Project

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Yaowei Zheng

Author of LLaMA-Factory

Pawel Garbacki

Cofounder of Fireworks AI

Dmytro Ivchenko

Cofounder of Fireworks AI

and 2 more!

Project Summary

This library provides Cut Cross-Entropy (CCE), a memory-efficient method for computing the cross-entropy loss in large-vocabulary language models. It targets researchers and engineers working with LLMs, offering significant memory reductions during training without compromising speed or convergence.

How It Works

CCE avoids materializing the full logit matrix by computing only the logit for the correct token and performing a log-sum-exp reduction on-the-fly. This is achieved via a custom Triton kernel that leverages flash memory for matrix multiplications and reductions, drastically reducing global memory consumption. Gradient computation is further optimized by skipping negligible contributions, improving throughput.

Quick Start & Requirements

Install: pip install "cut-cross-entropy @ git+https://github.com/apple/ml-cross-entropy.git"
Requirements: Python 3.10+, PyTorch 2.4+, Triton 3.0+, Ampere (or newer) GPU. A torch.compile fallback is available for unsupported systems (e.g., macOS).
Usage: from cut_cross_entropy import linear_cross_entropy
Docs: https://github.com/apple/ml-cross-entropy

Highlighted Details

Reduces loss computation memory from 24 GB to 1 MB for Gemma 2 (2B).
Supports vocabulary parallelism for sharded classifier weights.
Integrates with Hugging Face Transformers via cce_patch for Llama, Phi3, Mistral, and Gemma2 families.
Offers multiple implementations (cce, torch_compile, cce_kahan, cce_kahan_full_c, cce_exact) for different precision and performance needs.

Maintenance & Community

The project is from Apple. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The code is released under "LICENSE terms," which are not specified in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The primary CCE implementation requires an Ampere or newer GPU. While a torch.compile fallback exists, its performance characteristics may differ. The exact license terms are not specified, which could impact commercial adoption.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days