POPPER  by snap-stanford

Automated validation of free-form hypotheses

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

POPPER is an agentic framework for automated validation of free-form hypotheses, particularly those generated by LLMs. It addresses the impracticality of manually validating voluminous, abstract, or potentially hallucinated hypotheses by using LLM agents to design and execute falsification experiments. This offers researchers a scalable, rigorous, and significantly faster method for hypothesis validation across scientific domains, ensuring strict Type-I error control.

How It Works

POPPER leverages Karl Popper's principle of falsification. LLM agents design and execute experiments to disprove hypotheses by targeting their measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse data sources, enabling robust validation of complex, abstract hypotheses.

Quick Start & Requirements

Installation is recommended within a Python virtual environment (e.g., conda create -n popper_env python=3.10). Install via pip (pip install popper_agent) or from source. Prerequisites include setting OpenAI/Anthropic API keys as environment variables. POPPER supports inference with locally-served LLMs (vLLM, SGLang, llama.cpp) via OpenAI-compatible APIs. Datasets are auto-downloaded or specified via data_path. Paper link (arXiv:2502.09858) and demo are mentioned but not directly provided.

Highlighted Details

  • Achieved comparable performance to human scientists in biological hypothesis validation, reducing time by 10-fold.
  • Offers robust Type-I error control and high statistical power across biology, economics, and sociology.
  • Scalable architecture for handling large hypothesis/data volumes.
  • Includes a Gradio UI for interactive validation (agent.launch_UI()).

Maintenance & Community

Hosted by snap-stanford. Contact Kexin Huang (kexinh@cs.stanford.edu) or raise GitHub issues. No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

The specific open-source license is not explicitly stated, potentially impacting commercial use or integration. The framework is compatible with commercial LLM APIs and locally hosted models.

Limitations & Caveats

The absence of a declared license is a significant caveat for adoption. The system requires API keys for cloud LLMs or local LLM server setup. The README advises running benchmarks in containerized environments due to the agent's filesystem access capabilities.

Health Check
Last Commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

long-form-factuality by google-deepmind

0.5%
670
Benchmark for long-form factuality in LLMs
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.