LongLM  by datamllab

Self-Extend: LLM context window extension via self-attention

created 1 year ago
660 stars

Top 51.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an implementation of Self-Extend, a method to significantly extend the context window of Large Language Models (LLMs) without requiring any fine-tuning. It targets researchers and practitioners working with LLMs who need to process longer sequences, offering a way to leverage the inherent long-context capabilities of existing models.

How It Works

Self-Extend operates by constructing bi-level attention information: group-level and neighbor-level. These are computed using the model's existing self-attention mechanism, meaning no training is necessary. This approach stimulates the LLM's potential for handling longer contexts by intelligently structuring attention across segments of the input sequence.

Quick Start & Requirements

  • Install: Clone the repository.
  • Dependencies: transformers==4.38.2, flash_attn==2.5.6. A Docker image (hoytjin/selfextend_docker:v0.1) is recommended to avoid environment issues.
  • Usage: Apply the method via SelfExtend.apply(loaded_model, group_size, window_size, enable_flash_attention=False).
  • Example: Run python example.py.
  • Documentation: example.py

Highlighted Details

  • Supports Llama, Mistral, Phi-2, Qwen1.5, and Gemma models.
  • Offers Triton-implemented FlashSelfExtend for potential performance gains.
  • Demonstrated success in a Google I/O session for Gemma's long-context abilities.
  • Provides guidance and empirical rules for selecting group_size and neighbor_window hyperparameters.

Maintenance & Community

  • The project was accepted by ICML 2024.
  • Active development with recent updates for Llama-3 support.
  • A Discord server is available for discussions.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The effectiveness and optimal hyperparameter selection (group_size, neighbor_window) can depend on the specific model and task, with empirical rules provided as guidance. While FlashAttention is supported, its full functionality for the decoding stage is still under active debugging.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Feedback? Help us improve.