distilling-step-by-step by google-research

Code for research paper on knowledge distillation

Created 2 years ago

577 stars

Top 56.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

This repository provides code for the paper "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes." It enables users to train smaller language models to achieve performance comparable to or exceeding larger models, using less data and computational resources. The target audience includes researchers and practitioners in NLP and machine learning looking to optimize model efficiency and performance.

How It Works

The core approach involves a distillation technique that trains smaller models to mimic the reasoning process of larger models. This is achieved by generating intermediate "rationales" or step-by-step explanations from a larger LLM (like PaLM) and then using these rationales, along with the ground truth labels, to fine-tune a smaller T5 model. The alpha parameter controls the weighting between the rationale generation loss and the label prediction loss in multi-task training.

Quick Start & Requirements

Install: Conda environment setup with specific PyTorch (1.12.1), torchvision (0.13.1), and torchaudio (0.12.1) versions, and cudatoolkit=11.3. Install Python dependencies via pip, including a specific transformers version (v4.24.0).
Prerequisites: Python 3.10.6, Conda, PyTorch with CUDA 11.3, datasets library, sentencepiece, protobuf==3.20.*, tensorboardX. Unzip datasets.zip into the datasets/ directory.
Resources: Requires GPU with CUDA 11.3. Setup time involves environment creation and dependency installation.
Docs: Hugging Face Transformers

Highlighted Details

Supports fine-tuning with either ground truth labels (label_type gt) or LLM-predicted labels (label_type llm).
Enables multi-task training with an alpha parameter to balance label prediction and rationale generation losses.
Offers a task_prefix model type for the "distilling step-by-step" approach.
Compatible with various T5 model sizes (google/t5-v1_1-small to xxl) and datasets (esnli, anli1, cqa, svamp).

Maintenance & Community

The project is associated with Google Research. No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository itself is not explicitly licensed in the README. The code is likely subject to the Apache 2.0 license of the underlying Google Research projects, but this should be verified. Compatibility for commercial use depends on the licenses of the models and datasets used.

Limitations & Caveats

The setup requires specific, older versions of PyTorch and CUDA, which might pose compatibility challenges with newer hardware or software stacks. The project is presented as code for a specific paper, and its ongoing maintenance status is unclear.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days