evaporate  by HazyResearch

Code and data for a research paper on using LLMs to generate structured views of data lakes

created 2 years ago
489 stars

Top 64.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code and datasets for a paper on using language models to generate structured views from heterogeneous data lakes. It targets researchers and engineers working with large, unstructured datasets, offering a novel approach to schema generation and data extraction.

How It Works

The system leverages large language models (LLMs) for both schema identification and data extraction. For open information extraction (open IE), it first uses schema_identification.py to propose attributes for the schema. Then, profiler.py iterates through these attributes, generating functions to extract data from documents. The quality of these generated functions is evaluated against direct LLM prompting using evaluate_profiler.py.

Quick Start & Requirements

  • Install:
    • Clone the repository: git clone git@github.com:HazyResearch/evaporate.git
    • Install main package: cd evaporate; pip install -e .
    • Initialize and install submodules: cd metal-evap; git submodule init; git submodule update; pip install -e .
    • Install Manifest: cd manifest; pip install -e .
  • Prerequisites: Python 3.8, Conda environment recommended. LLM API keys (e.g., OpenAI) are required for running inference.
  • Data: Datasets are available on Hugging Face (hazyresearch/evaporate) and expected at /data/evaporate/ (configurable).
  • Running: Execute bash run.sh from the src/ directory for closed and open IE runs.
  • Documentation: run_profiler.py provides a code walkthrough.

Highlighted Details

  • Demonstrates LLMs for automated schema generation and data extraction from data lakes.
  • Evaluates generated extraction functions against direct LLM prompting.
  • Supports both closed and open information extraction tasks.

Maintenance & Community

The project is associated with HazyResearch. Further community or roadmap information is not detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The system requires API keys for LLM providers, and the data path is hardcoded to /data/evaporate/ by default, requiring modification for different setups. The README does not specify the project's license.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.