Hyper-Extract  by yifanfeng97

LLM-driven framework for transforming text into structured knowledge

Created 3 months ago
524 stars

Top 59.9% on SourcePulse

GitHubView on GitHub
Project Summary

Transforms unstructured text into structured knowledge using LLMs, enabling users to generate Knowledge Abstracts from documents. It supports a wide array of output formats, including simple collections, Pydantic models, and complex Knowledge Graphs, Hypergraphs, and Spatio-Temporal Graphs. This framework is designed for engineers, researchers, and power users seeking to efficiently extract and understand information from diverse text sources, offering a "stop reading, start understanding" paradigm.

How It Works

Hyper-Extract employs a three-layer architecture: Auto-Types define structured output formats (e.g., AutoGraph, AutoHypergraph, AutoSpatioTemporalGraph), Methods provide extraction algorithms (including RAG-based approaches like GraphRAG and Hyper-RAG), and Templates offer domain-specific configurations. This design allows for declarative, zero-code extraction via YAML templates, supporting over 80 presets across six domains. The framework facilitates incremental knowledge evolution, allowing continuous updates as new documents are processed.

Quick Start & Requirements

  • Primary Install/Run:
    • CLI Users: uv tool install hyperextract
    • Python Developers: uv pip install hyperextract
    • Local Development: Clone https://github.com/yifanfeng97/hyper-extract.git, cd hyper-extract, then uv sync.
  • Prerequisites: Python 3.11+, OpenAI API Key. Defaults to gpt-4o-mini and text-embedding-3-small.
  • Links:
    • Docs: https://yifanfeng97.github.io/Hyper-Extract/latest/
    • Examples: examples/en/ directory within the repository.

Highlighted Details

  • Supports advanced graph structures including Hypergraphs and Spatio-Temporal Graphs, differentiating it from libraries focused solely on traditional knowledge graphs.
  • Offers 8 distinct Auto-Types for knowledge representation, ranging from basic models to complex graph structures.
  • Integrates over 10 extraction engines, featuring cutting-edge retrieval paradigms like GraphRAG, LightRAG, and Hyper-RAG.
  • Provides a declarative approach with 80+ domain-specific YAML templates across Finance, Legal, Medical, TCM, Industry, and General domains, enabling zero-code extraction.
  • Delivers both a user-friendly global CLI tool (he) and a Python API for seamless integration into existing workflows.
  • Enables incremental knowledge updates, allowing the knowledge base to evolve dynamically with new data.

Maintenance & Community

The project is marked as active. Contributions via Issues and Pull Requests are welcomed. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

Licensed under the Apache-2.0 license. This license is permissive and generally compatible with commercial use and linking within closed-source projects.

Limitations & Caveats

The default configuration relies on OpenAI API keys, which may incur costs and introduce vendor-specific dependencies. While the framework supports multiple extraction engines, detailed performance benchmarks or comparisons against non-OpenAI LLM providers are not explicitly presented.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
1
Star History
541 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.