Discover and explore top open-source AI tools and projects—updated daily.
explosionspaCy plugin for structured PDF/document processing
Top 44.3% on SourcePulse
This library processes PDFs and Word documents, converting them into structured data compatible with spaCy's Doc objects. It's designed for NLP practitioners and researchers needing to apply advanced text analysis, entity recognition, or chunking for RAG pipelines to document collections. The primary benefit is transforming unstructured document content into machine-readable formats with rich layout metadata.
How It Works
The plugin leverages the Docling library to parse various document formats. It extracts text, identifies layout elements like sections and tables, and converts tabular data into pandas DataFrames. These are then mapped to spaCy's Span objects, accessible via doc.spans["layout"], with custom attributes like span._.layout for bounding box information and span._.heading for the nearest heading. This approach integrates document structure directly into the NLP workflow, enabling richer analysis.
Quick Start & Requirements
pip install spacy-layoutpython -m spacy download en_core_web_trfHighlighted Details
doc._.tables.Doc objects to spaCy's binary format for efficient reuse.Doc.text via a display_table callback.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
spaCyLayout when deserializing Doc objects from binary files, a planned area for improvement.8 months ago
Inactive
nlmatics
opendatalab