InfLLM  by thunlp

Research paper code for long-sequence LLM processing via training-free memory

created 1 year ago
373 stars

Top 77.1% on sourcepulse

GitHubView on GitHub
Project Summary

InfLLM enables Large Language Models (LLMs) to process extremely long sequences without retraining, addressing the limitations of standard LLMs on inputs exceeding their training context length. This training-free memory-based method is designed for researchers and developers working with LLM-driven agents or applications requiring analysis of lengthy streaming data.

How It Works

InfLLM stores distant context into memory units and uses an efficient lookup mechanism to retrieve relevant units for attention computation. This approach allows LLMs to maintain long-distance dependency capture, overcoming the limitations of methods that discard distant tokens. The system supports configurable memory unit retrieval strategies (e.g., LRU, FIFO) and can optionally leverage FAISS for faster retrieval.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (from the project root)
  • Prerequisites: PyTorch >= 1.13.1, Transformers >= 4.37.2, Flash-Attention, FAISS (optional).
  • Usage: Evaluate with bash scripts/[infinitebench,longbench].sh or run a chatbot with python -m inf_llm.chat --model-path <model_path> --inf-llm-config-path <config_path.yaml>.
  • Resources: Requires significant GPU memory for long sequences; specific requirements depend on model size and sequence length.
  • Links: Paper, Code

Highlighted Details

  • Achieves superior performance compared to baselines that continually train LLMs on long sequences, without any additional training.
  • Effectively captures long-distance dependencies even when sequence lengths scale to 1,024K tokens.
  • Supports multiple LLM architectures and conversation types, with recent additions for LLaMA 3.
  • Offers configurable memory management strategies (LRU, FIFO, LRU-S) and retrieval mechanisms.

Maintenance & Community

  • Initial code release on March 3, 2024, with subsequent refactors for speed and memory efficiency.
  • Supports FAISS for top-k retrieval and LLaMA 3.
  • No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The code is released for research purposes.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The perhead attention option is noted as very time-consuming and intended for research use only.
  • FAISS integration increases inference time.
  • Configuration for async_global_stream may not be compatible.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
1 more.

LookaheadDecoding by hao-ai-lab

0.1%
1k
Parallel decoding algorithm for faster LLM inference
created 1 year ago
updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Feedback? Help us improve.