Code and data for a research paper on using LLMs to generate structured views of data lakes
Top 64.0% on sourcepulse
This repository provides code and datasets for a paper on using language models to generate structured views from heterogeneous data lakes. It targets researchers and engineers working with large, unstructured datasets, offering a novel approach to schema generation and data extraction.
How It Works
The system leverages large language models (LLMs) for both schema identification and data extraction. For open information extraction (open IE), it first uses schema_identification.py
to propose attributes for the schema. Then, profiler.py
iterates through these attributes, generating functions to extract data from documents. The quality of these generated functions is evaluated against direct LLM prompting using evaluate_profiler.py
.
Quick Start & Requirements
git clone git@github.com:HazyResearch/evaporate.git
cd evaporate; pip install -e .
cd metal-evap; git submodule init; git submodule update; pip install -e .
cd manifest; pip install -e .
hazyresearch/evaporate
) and expected at /data/evaporate/
(configurable).bash run.sh
from the src/
directory for closed and open IE runs.run_profiler.py
provides a code walkthrough.Highlighted Details
Maintenance & Community
The project is associated with HazyResearch. Further community or roadmap information is not detailed in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The system requires API keys for LLM providers, and the data path is hardcoded to /data/evaporate/
by default, requiring modification for different setups. The README does not specify the project's license.
1 year ago
1 day