LLMxMapReduce  by thunlp

Framework for LLM long-sequence processing via MapReduce-inspired divide-and-conquer

created 10 months ago
776 stars

Top 45.9% on sourcepulse

GitHubView on GitHub
Project Summary

LLMxMapReduce is a framework for processing and generating long sequences using large language models (LLMs), inspired by the MapReduce paradigm. It aims to address the challenge of integrating and analyzing information from extensive inputs, enabling LLMs to handle long-to-long generation tasks more effectively. The target audience includes researchers and developers working with LLMs on applications requiring long-form content generation.

How It Works

LLMxMapReduce-V2 employs an entropy-driven convolutional test-time scaling mechanism. This approach, drawing from convolutional neural networks, uses stacked convolutional scaling layers to progressively integrate local features into higher-level global representations. This iterative refinement allows LLMs to better process and synthesize information from extremely large input volumes, improving coherence and informativeness in generated long-form articles.

Quick Start & Requirements

  • Installation: Clone the repository, create and activate a conda environment (conda create -n llm_mr_v2 python=3.11, conda activate llm_mr_v2), install dependencies (pip install -r requirements.txt), and install Playwright browsers (python -m playwright install --with-deps chromium).
  • Prerequisites: Python 3.11, conda, Playwright, NLTK (nltk.download('punkt_tab')).
  • Environment Variables: OPENAI_API_KEY, OPENAI_API_BASE, GOOGLE_API_KEY, SERP_API_KEY (optional), PROMPT_LANGUAGE (optional, defaults to English).
  • Model Configuration: model_config.json specifies API type (OpenAI/Google) and model names.
  • Running Pipeline: bash scripts/pipeline_start.sh TOPIC output_file_path.jsonl
  • Data Format: Input data should be a JSONL file with title and papers (each containing title, optional abstract, and txt).
  • Links: LLMxMapReduce-V2 Paper, LLMxMapReduce-V1 Paper

Highlighted Details

  • LLMxMapReduce-V2 powers the online SurveyGO writing system.
  • V1 enabled MiniCPM3-4B to outperform 70B models in long-context evaluations.
  • Supports both OpenAI and OpenAI-compatible APIs (e.g., vLLM).
  • Evaluation uses the SurveyEval dataset and requires sufficient API balance.

Maintenance & Community

Developed collaboratively by AI9STARS, OpenBMB, and THUNLP.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project strongly recommends using Gemini Flash models, warning of potential unknown errors with other models. It is not recommended for use with locally deployed models due to high API consumption and concurrency requirements.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
213 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.