LongLM by datamllab

Self-Extend: LLM context window extension via self-attention

Created 2 years ago

665 stars

Top 50.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luis Capelo

Cofounder of Lightning AI

Project Summary

This repository provides an implementation of Self-Extend, a method to significantly extend the context window of Large Language Models (LLMs) without requiring any fine-tuning. It targets researchers and practitioners working with LLMs who need to process longer sequences, offering a way to leverage the inherent long-context capabilities of existing models.

How It Works

Self-Extend operates by constructing bi-level attention information: group-level and neighbor-level. These are computed using the model's existing self-attention mechanism, meaning no training is necessary. This approach stimulates the LLM's potential for handling longer contexts by intelligently structuring attention across segments of the input sequence.

Quick Start & Requirements

Install: Clone the repository.
Dependencies: transformers==4.38.2, flash_attn==2.5.6. A Docker image (hoytjin/selfextend_docker:v0.1) is recommended to avoid environment issues.
Usage: Apply the method via SelfExtend.apply(loaded_model, group_size, window_size, enable_flash_attention=False).
Example: Run python example.py.
Documentation: example.py

Highlighted Details

Supports Llama, Mistral, Phi-2, Qwen1.5, and Gemma models.
Offers Triton-implemented FlashSelfExtend for potential performance gains.
Demonstrated success in a Google I/O session for Gemma's long-context abilities.
Provides guidance and empirical rules for selecting group_size and neighbor_window hyperparameters.

Maintenance & Community

The project was accepted by ICML 2024.
Active development with recent updates for Llama-3 support.
A Discord server is available for discussions.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The effectiveness and optimal hyperparameter selection (group_size, neighbor_window) can depend on the specific model and task, with empirical rules provided as guidance. While FlashAttention is supported, its full functionality for the decoding stage is still under active debugging.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days