LLMxMapReduce by thunlp

Framework for LLM long-sequence processing via MapReduce-inspired divide-and-conquer

Created 1 year ago

848 stars

Top 42.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

LLMxMapReduce is a framework for processing and generating long sequences using large language models (LLMs), inspired by the MapReduce paradigm. It aims to address the challenge of integrating and analyzing information from extensive inputs, enabling LLMs to handle long-to-long generation tasks more effectively. The target audience includes researchers and developers working with LLMs on applications requiring long-form content generation.

How It Works

LLMxMapReduce-V2 employs an entropy-driven convolutional test-time scaling mechanism. This approach, drawing from convolutional neural networks, uses stacked convolutional scaling layers to progressively integrate local features into higher-level global representations. This iterative refinement allows LLMs to better process and synthesize information from extremely large input volumes, improving coherence and informativeness in generated long-form articles.

Quick Start & Requirements

Installation: Clone the repository, create and activate a conda environment (conda create -n llm_mr_v2 python=3.11, conda activate llm_mr_v2), install dependencies (pip install -r requirements.txt), and install Playwright browsers (python -m playwright install --with-deps chromium).
Prerequisites: Python 3.11, conda, Playwright, NLTK (nltk.download('punkt_tab')).
Environment Variables: OPENAI_API_KEY, OPENAI_API_BASE, GOOGLE_API_KEY, SERP_API_KEY (optional), PROMPT_LANGUAGE (optional, defaults to English).
Model Configuration: model_config.json specifies API type (OpenAI/Google) and model names.
Running Pipeline: bash scripts/pipeline_start.sh TOPIC output_file_path.jsonl
Data Format: Input data should be a JSONL file with title and papers (each containing title, optional abstract, and txt).
Links: LLMxMapReduce-V2 Paper, LLMxMapReduce-V1 Paper

Highlighted Details

LLMxMapReduce-V2 powers the online SurveyGO writing system.
V1 enabled MiniCPM3-4B to outperform 70B models in long-context evaluations.
Supports both OpenAI and OpenAI-compatible APIs (e.g., vLLM).
Evaluation uses the SurveyEval dataset and requires sufficient API balance.

Maintenance & Community

Developed collaboratively by AI9STARS, OpenBMB, and THUNLP.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project strongly recommends using Gemini Flash models, warning of potential unknown errors with other models. It is not recommended for use with locally deployed models due to high API consumption and concurrency requirements.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days