Document-Parser-Agent  by Micheliliuv87

AI agent for structured data extraction from documents

Created 11 months ago
261 stars

Top 97.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project offers an AI-powered agent designed to parse unstructured data from Excel files, specifically targeting Google's product update logs. It transforms this raw information into structured formats suitable for analysis by extracting key details such as feature names, associated actions (added, removed, updated), and affected products. The agent is beneficial for users needing to systematically organize and analyze product update histories, leveraging advanced LLM capabilities for accurate data extraction and reporting.

How It Works

The core of the agent utilizes OpenAI's ChatGPT 4-o model, guided by meta-prompt engineering techniques and a defined prompt template. To manage the LLM's context window limitations, input Excel sheets are first processed and split into smaller, manageable yearly files grouped by month. A dedicated parser then iterates through rows, extracts relevant text, and employs the LLM to identify and categorize feature updates. The extracted data is saved as JSON, which is subsequently cleaned and consolidated into structured Excel workbooks, first as monthly/yearly summaries and then potentially transformed into a feature-centric timeline format.

Quick Start & Requirements

  1. Input: An Excel file containing columns: Date, Title, Features, Editions.
  2. Setup: Requires a Python environment and an OpenAI API key.
  3. Execution:
    • Run prepare.py to segment the input document.
    • Run main.py to initiate the full parsing and structuring pipeline.
    • The script Convert_to_FeatureSpecific.py can be used for final output transformation.
  • Links: No external documentation or demo links are provided.

Highlighted Details

  • Leverages OpenAI's ChatGPT 4-o model for sophisticated natural language understanding.
  • Employs meta-prompt engineering to enhance the reliability of information extraction.
  • Addresses potential LLM context limitations by chunking input data into yearly and monthly files.
  • Generates structured output in two formats: detailed monthly/yearly breakdowns and a feature-focused timeline.

Maintenance & Community

No information regarding maintainers, community channels (e.g., Discord, Slack), project roadmap, or sponsorships is available in the provided text.

Licensing & Compatibility

The specific license under which this project is distributed, and any associated compatibility notes for commercial use or integration with closed-source systems, are not detailed in the provided README content.

Limitations & Caveats

The agent's effectiveness is contingent upon the input Excel file strictly adhering to the specified column structure (Date, Title, Features, Editions). The multi-stage processing involving several Python scripts and an external LLM API may introduce setup and debugging complexities. While data chunking is used to mitigate LLM hallucination, inherent LLM limitations may still result in occasional inaccuracies.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
78 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.