Hyper-Extract by yifanfeng97

LLM-driven framework for transforming text into structured knowledge

Created 6 months ago

3,022 stars

Top 15.3% on SourcePulse

Project Summary

Transforms unstructured text into structured knowledge using LLMs, enabling users to generate Knowledge Abstracts from documents. It supports a wide array of output formats, including simple collections, Pydantic models, and complex Knowledge Graphs, Hypergraphs, and Spatio-Temporal Graphs. This framework is designed for engineers, researchers, and power users seeking to efficiently extract and understand information from diverse text sources, offering a "stop reading, start understanding" paradigm.

How It Works

Hyper-Extract employs a three-layer architecture: Auto-Types define structured output formats (e.g., AutoGraph, AutoHypergraph, AutoSpatioTemporalGraph), Methods provide extraction algorithms (including RAG-based approaches like GraphRAG and Hyper-RAG), and Templates offer domain-specific configurations. This design allows for declarative, zero-code extraction via YAML templates, supporting over 80 presets across six domains. The framework facilitates incremental knowledge evolution, allowing continuous updates as new documents are processed.

Quick Start & Requirements

Primary Install/Run:
- CLI Users: uv tool install hyperextract
- Python Developers: uv pip install hyperextract
- Local Development: Clone https://github.com/yifanfeng97/hyper-extract.git, cd hyper-extract, then uv sync.
Prerequisites: Python 3.11+, OpenAI API Key. Defaults to gpt-4o-mini and text-embedding-3-small.
Links:
- Docs: https://yifanfeng97.github.io/Hyper-Extract/latest/
- Examples: examples/en/ directory within the repository.

Highlighted Details

Supports advanced graph structures including Hypergraphs and Spatio-Temporal Graphs, differentiating it from libraries focused solely on traditional knowledge graphs.
Offers 8 distinct Auto-Types for knowledge representation, ranging from basic models to complex graph structures.
Integrates over 10 extraction engines, featuring cutting-edge retrieval paradigms like GraphRAG, LightRAG, and Hyper-RAG.
Provides a declarative approach with 80+ domain-specific YAML templates across Finance, Legal, Medical, TCM, Industry, and General domains, enabling zero-code extraction.
Delivers both a user-friendly global CLI tool (he) and a Python API for seamless integration into existing workflows.
Enables incremental knowledge updates, allowing the knowledge base to evolve dynamically with new data.

Maintenance & Community

The project is marked as active. Contributions via Issues and Pull Requests are welcomed. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

Licensed under the Apache-2.0 license. This license is permissive and generally compatible with commercial use and linking within closed-source projects.

Limitations & Caveats

The default configuration relies on OpenAI API keys, which may incur costs and introduce vendor-specific dependencies. While the framework supports multiple extraction engines, detailed performance benchmarks or comparisons against non-OpenAI LLM providers are not explicitly presented.

Hyper-Extract by yifanfeng97

Explore Similar Projects

LeanRAG by KnowledgeXLab

sift-kg by juanceresa

KG-Pipeline by FareedKhan-dev

knowledge-table by whyhow-ai

itext2kg by AuvaLab

AutoSchemaKG by HKUST-KnowComp

goingmeta by jbarrasa

llm-wiki-compiler by atomicstrata

kg-gen by stair-lab

Yuxi by xerrors

graphrag by microsoft

ragflow by infiniflow