pasa by bytedance

LLM agent for academic paper search

Created 1 year ago

1,473 stars

Top 27.7% on SourcePulse

Project Summary

PaSa is an LLM-powered agent designed for comprehensive academic paper search, targeting researchers and academics. It automates complex scholarly queries by autonomously invoking search tools, reading papers, and selecting relevant references, aiming to deliver accurate and thorough results.

How It Works

PaSa employs a two-agent architecture: a Crawler and a Selector. The Crawler interacts with search tools, expands citations, and manages a paper queue. The Selector evaluates papers in the queue based on user query criteria, assigning relevance scores. This modular design allows for specialized optimization of search and evaluation tasks.

Quick Start & Requirements

Installation: Clone the transformers and trl repositories, install them in editable mode (pip install -e .), and then install project dependencies (pip install -r requirements.txt).
Prerequisites: Requires a Google Search API key from serper.dev.
Data: Download datasets from pasa-dataset.
Models: Download pre-trained checkpoints for pasa-7b-crawler and pasa-7b-selector.
Execution: Run the agent using python run_paper_agent.py.
Training: Custom training requires modifying trl and transformers codebases. Detailed SFT and PPO training scripts are provided.

Highlighted Details

Outperforms baselines like Google, Google Scholar, ChatGPT, and GPT-4o on both synthetic (AutoScholarQuery) and real-world (RealScholarQuery) datasets.
PaSa-7B achieves significant recall improvements (e.g., 37.78% recall@20 on RealScholarQuery vs. Google with GPT-4o).
Ensembling the Crawler (running it multiple times) further boosts performance.
Supports custom training via SFT and PPO, with provided scripts and dataset examples.

Maintenance & Community

The project is associated with ByteDance. Citation details are available in BibTeX format. Links to community channels (Discord/Slack) or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. The code modifications to trl and transformers suggest potential licensing implications from those libraries. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The system requires a Google Search API key, which may incur costs. Training custom agents involves significant computational resources and requires familiarity with accelerate and deepspeed. The project's reliance on external APIs and specific library versions might impact long-term stability.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days