spacy-layout  by explosion

spaCy plugin for structured PDF/document processing

created 8 months ago
693 stars

Top 50.0% on sourcepulse

GitHubView on GitHub
Project Summary

This library processes PDFs and Word documents, converting them into structured data compatible with spaCy's Doc objects. It's designed for NLP practitioners and researchers needing to apply advanced text analysis, entity recognition, or chunking for RAG pipelines to document collections. The primary benefit is transforming unstructured document content into machine-readable formats with rich layout metadata.

How It Works

The plugin leverages the Docling library to parse various document formats. It extracts text, identifies layout elements like sections and tables, and converts tabular data into pandas DataFrames. These are then mapped to spaCy's Span objects, accessible via doc.spans["layout"], with custom attributes like span._.layout for bounding box information and span._.heading for the nearest heading. This approach integrates document structure directly into the NLP workflow, enabling richer analysis.

Quick Start & Requirements

  • Install via pip: pip install spacy-layout
  • Requires Python 3.10 or above.
  • For transformer-based pipelines, download models: python -m spacy download en_core_web_trf
  • Official docs: https://spacy-dev.github.io/spacy-layout/

Highlighted Details

  • Extracts layout spans (sections, headers, text) with bounding box coordinates.
  • Converts tables into pandas DataFrames accessible via doc._.tables.
  • Supports serialization of processed Doc objects to spaCy's binary format for efficient reuse.
  • Allows customization of table representation in the Doc.text via a display_table callback.

Maintenance & Community

  • Developed by Explosion AI, the creators of spaCy.
  • Active development indicated by recent updates and a clear roadmap for improvements.
  • Community support likely through existing spaCy channels.

Licensing & Compatibility

  • MIT License.
  • Permissive, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Custom extension attributes require re-initialization of spaCyLayout when deserializing Doc objects from binary files, a planned area for improvement.
  • Accuracy of heading detection depends on the document's structural consistency.
Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
129 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.