spacy-layout by explosion

spaCy plugin for structured PDF/document processing

Created 1 year ago

838 stars

Top 42.5% on SourcePulse

Project Summary

This library processes PDFs and Word documents, converting them into structured data compatible with spaCy's Doc objects. It's designed for NLP practitioners and researchers needing to apply advanced text analysis, entity recognition, or chunking for RAG pipelines to document collections. The primary benefit is transforming unstructured document content into machine-readable formats with rich layout metadata.

How It Works

The plugin leverages the Docling library to parse various document formats. It extracts text, identifies layout elements like sections and tables, and converts tabular data into pandas DataFrames. These are then mapped to spaCy's Span objects, accessible via doc.spans["layout"], with custom attributes like span._.layout for bounding box information and span._.heading for the nearest heading. This approach integrates document structure directly into the NLP workflow, enabling richer analysis.

Quick Start & Requirements

Install via pip: pip install spacy-layout
Requires Python 3.10 or above.
For transformer-based pipelines, download models: python -m spacy download en_core_web_trf
Official docs: https://spacy-dev.github.io/spacy-layout/

Highlighted Details

Extracts layout spans (sections, headers, text) with bounding box coordinates.
Converts tables into pandas DataFrames accessible via doc._.tables.
Supports serialization of processed Doc objects to spaCy's binary format for efficient reuse.
Allows customization of table representation in the Doc.text via a display_table callback.

Maintenance & Community

Developed by Explosion AI, the creators of spaCy.
Active development indicated by recent updates and a clear roadmap for improvements.
Community support likely through existing spaCy channels.

Licensing & Compatibility

MIT License.
Permissive, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

Custom extension attributes require re-initialization of spaCyLayout when deserializing Doc objects from binary files, a planned area for improvement.
Accuracy of heading detection depends on the document's structural consistency.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days