llm-mistral-invoice-cpu by katanaml

LLM data extraction on CPU

Created 2 years ago

270 stars

Top 95.4% on SourcePulse

Project Summary

This project provides a method for extracting data from invoices using the Mistral Large Language Model (LLM) on a local CPU. It is designed for users who need to process invoice documents without relying on cloud-based services or powerful GPUs, offering a self-contained solution for automated data extraction.

How It Works

The system processes text-based PDF invoices by first converting them into vector embeddings using a FAISS index. Subsequently, the Mistral LLM is employed to query these embeddings and extract specific information, such as invoice numbers, based on natural language prompts. This approach allows for efficient retrieval and extraction of structured data from unstructured invoice documents.

Quick Start & Requirements

Install requirements: pip install -r requirements.txt
Download Mistral model (link provided in models/model_download.txt).
Place text PDF files in the data folder.
Ingest data: python ingest.py
Process data: python main.py "retrieve invoice number value"
Requires Python and a compatible Mistral model.

Highlighted Details

Enables LLM-based invoice data extraction on CPU.
Utilizes FAISS for efficient vector embedding storage and retrieval.
Supports processing of text-based PDF invoices.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels is provided in the README.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

Performance will be significantly impacted by CPU capabilities. The project focuses solely on text-based PDFs, and image-based invoices would require an additional OCR step. The README does not specify the exact Mistral model version or its licensing.

llm-mistral-invoice-cpu by katanaml

Explore Similar Projects

A-Guide-to-Retrieval-Augmented-LLM by Wang-Shuo

llmparser by kyang6

sycamore by aryn-ai

financial-datasets by virattt

OpenContracts by Open-Source-Legal

BriefGPT by e-johnstonn

ExtractThinker by enoch3712

Local_Pdf_Chat_RAG by weiwill88

llmsherpa by nlmatics

datatrove by huggingface

WeKnora by Tencent

rag-from-scratch by langchain-ai