spaCy plugin for structured PDF/document processing
Top 50.0% on sourcepulse
This library processes PDFs and Word documents, converting them into structured data compatible with spaCy's Doc
objects. It's designed for NLP practitioners and researchers needing to apply advanced text analysis, entity recognition, or chunking for RAG pipelines to document collections. The primary benefit is transforming unstructured document content into machine-readable formats with rich layout metadata.
How It Works
The plugin leverages the Docling
library to parse various document formats. It extracts text, identifies layout elements like sections and tables, and converts tabular data into pandas DataFrames. These are then mapped to spaCy's Span
objects, accessible via doc.spans["layout"]
, with custom attributes like span._.layout
for bounding box information and span._.heading
for the nearest heading. This approach integrates document structure directly into the NLP workflow, enabling richer analysis.
Quick Start & Requirements
pip install spacy-layout
python -m spacy download en_core_web_trf
Highlighted Details
doc._.tables
.Doc
objects to spaCy's binary format for efficient reuse.Doc.text
via a display_table
callback.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
spaCyLayout
when deserializing Doc
objects from binary files, a planned area for improvement.4 months ago
1 week