LongAlign  by THUDM

Recipe for long-context LLM alignment (research paper)

Created 1 year ago
256 stars

Top 98.7% on SourcePulse

GitHubView on GitHub
Project Summary

LongAlign provides a comprehensive framework for aligning Large Language Models (LLMs) to effectively process and respond to long-context inputs, addressing the challenge of maintaining performance with extended text. It is targeted at researchers and developers working with LLMs who need to improve their capabilities in handling lengthy documents, conversations, or codebases.

How It Works

LongAlign introduces the LongAlign-10k dataset, featuring 10,000 instruction-following examples ranging from 8k to 64k tokens. The core innovation lies in its training strategies: "packing" with loss weighting and "sorted batching." Packing groups multiple short sequences into a single long sequence, using attention masks to delineate individual examples and applying weighted loss to focus on relevant parts. Sorted batching arranges sequences by length to optimize GPU utilization. These methods are designed to efficiently train LLMs on extended contexts without significant performance degradation.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.x, PyTorch, Hugging Face Transformers. FlashAttention 2 is recommended for Llama-based models to save GPU memory.
  • Hardware: At least 8 x 80GB GPUs are recommended for training with 64k context length to avoid memory overflow.
  • Data: Download LongAlign-10k from Hugging Face datasets. ShareGPT data is also used.
  • Links: 🤗 HF Repo, Paper

Highlighted Details

  • Introduces LongBench-Chat, a benchmark for evaluating LLMs on queries from 10k-100k tokens.
  • Provides pre-trained models with extended context windows: LongAlign-6B-64k, LongAlign-7B-64k, LongAlign-13B-64k, and ChatGLM3-6B-128k.
  • Offers specific implementation details and code modifications for packing and sorted batching strategies.
  • Includes evaluation code for "Needle In A Haystack" tests and integration with other benchmarks like LongBench and MT-Bench.

Maintenance & Community

The project is associated with THUDM (Tsinghua University) and has contributions from multiple researchers. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The project appears to be released under a permissive license, allowing for commercial use and integration with closed-source applications, though specific license details beyond the citation are not detailed.

Limitations & Caveats

Training requires substantial GPU resources (8x 80GB GPUs recommended), potentially limiting accessibility for users with less powerful hardware. The effectiveness of packing and sorted batching may vary depending on the specific LLM architecture and dataset characteristics.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
4 more.

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.