LongAlign by THUDM

Recipe for long-context LLM alignment (research paper)

Created 1 year ago

257 stars

Top 98.3% on SourcePulse

Project Summary

LongAlign provides a comprehensive framework for aligning Large Language Models (LLMs) to effectively process and respond to long-context inputs, addressing the challenge of maintaining performance with extended text. It is targeted at researchers and developers working with LLMs who need to improve their capabilities in handling lengthy documents, conversations, or codebases.

How It Works

LongAlign introduces the LongAlign-10k dataset, featuring 10,000 instruction-following examples ranging from 8k to 64k tokens. The core innovation lies in its training strategies: "packing" with loss weighting and "sorted batching." Packing groups multiple short sequences into a single long sequence, using attention masks to delineate individual examples and applying weighted loss to focus on relevant parts. Sorted batching arranges sequences by length to optimize GPU utilization. These methods are designed to efficiently train LLMs on extended contexts without significant performance degradation.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.x, PyTorch, Hugging Face Transformers. FlashAttention 2 is recommended for Llama-based models to save GPU memory.
Hardware: At least 8 x 80GB GPUs are recommended for training with 64k context length to avoid memory overflow.
Data: Download LongAlign-10k from Hugging Face datasets. ShareGPT data is also used.
Links: 🤗 HF Repo, Paper

Highlighted Details

Introduces LongBench-Chat, a benchmark for evaluating LLMs on queries from 10k-100k tokens.
Provides pre-trained models with extended context windows: LongAlign-6B-64k, LongAlign-7B-64k, LongAlign-13B-64k, and ChatGLM3-6B-128k.
Offers specific implementation details and code modifications for packing and sorted batching strategies.
Includes evaluation code for "Needle In A Haystack" tests and integration with other benchmarks like LongBench and MT-Bench.

Maintenance & Community

The project is associated with THUDM (Tsinghua University) and has contributions from multiple researchers. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The project appears to be released under a permissive license, allowing for commercial use and integration with closed-source applications, though specific license details beyond the citation are not detailed.

Limitations & Caveats

Training requires substantial GPU resources (8x 80GB GPUs recommended), potentially limiting accessibility for users with less powerful hardware. The effectiveness of packing and sorted batching may vary depending on the specific LLM architecture and dataset characteristics.

LongAlign by THUDM

Explore Similar Projects

long-llms-learning by Strivin0311

FILM by microsoft

InstructionZoo by FreedomIntelligence

Long-Context-Data-Engineering by FranxYao

LongChat by DachengLi1

InfLLM by thunlp

Chinese-Mixtral by ymcui

Samba by microsoft

Orion by OrionStarAI

long_llama by CStanKonrad

TigerBot by TigerResearch

LongLoRA by JIA-Lab-research