pasa  by bytedance

LLM agent for academic paper search

created 7 months ago
1,264 stars

Top 32.0% on sourcepulse

GitHubView on GitHub
Project Summary

PaSa is an LLM-powered agent designed for comprehensive academic paper search, targeting researchers and academics. It automates complex scholarly queries by autonomously invoking search tools, reading papers, and selecting relevant references, aiming to deliver accurate and thorough results.

How It Works

PaSa employs a two-agent architecture: a Crawler and a Selector. The Crawler interacts with search tools, expands citations, and manages a paper queue. The Selector evaluates papers in the queue based on user query criteria, assigning relevance scores. This modular design allows for specialized optimization of search and evaluation tasks.

Quick Start & Requirements

  • Installation: Clone the transformers and trl repositories, install them in editable mode (pip install -e .), and then install project dependencies (pip install -r requirements.txt).
  • Prerequisites: Requires a Google Search API key from serper.dev.
  • Data: Download datasets from pasa-dataset.
  • Models: Download pre-trained checkpoints for pasa-7b-crawler and pasa-7b-selector.
  • Execution: Run the agent using python run_paper_agent.py.
  • Training: Custom training requires modifying trl and transformers codebases. Detailed SFT and PPO training scripts are provided.

Highlighted Details

  • Outperforms baselines like Google, Google Scholar, ChatGPT, and GPT-4o on both synthetic (AutoScholarQuery) and real-world (RealScholarQuery) datasets.
  • PaSa-7B achieves significant recall improvements (e.g., 37.78% recall@20 on RealScholarQuery vs. Google with GPT-4o).
  • Ensembling the Crawler (running it multiple times) further boosts performance.
  • Supports custom training via SFT and PPO, with provided scripts and dataset examples.

Maintenance & Community

The project is associated with ByteDance. Citation details are available in BibTeX format. Links to community channels (Discord/Slack) or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. The code modifications to trl and transformers suggest potential licensing implications from those libraries. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The system requires a Google Search API key, which may incur costs. Training custom agents involves significant computational resources and requires familiarity with accelerate and deepspeed. The project's reliance on external APIs and specific library versions might impact long-term stability.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
128 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.1%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Feedback? Help us improve.