sycamore  by aryn-ai

LLM-powered platform for unstructured data search and analytics

created 2 years ago
548 stars

Top 59.1% on sourcepulse

GitHubView on GitHub
Project Summary

Sycamore is an AI-powered platform for processing, analyzing, and enriching unstructured documents, targeting engineers and researchers building ETL pipelines, RAG systems, and LLM applications. It offers enhanced data chunking and recall for improved AI model performance on diverse document types.

How It Works

Sycamore utilizes Aryn DocParse, a GPU-powered API leveraging a DETR AI model trained on enterprise documents, for advanced document segmentation, OCR, and table extraction. This approach aims for superior data chunking accuracy and recall in hybrid search and RAG compared to other systems. The platform is built around a DocSet abstraction, enabling scalable, functional data transformations and reliable loading into various vector databases.

Quick Start & Requirements

Highlighted Details

  • Integrates Aryn DocParse with a vision AI model for semantic document structure preservation.
  • DocSet abstraction for scalable, functional document manipulation.
  • Supports high-quality table extraction, OCR, visual summarization, and LLM-powered UDFs.
  • Includes automatic data crawlers (S3, HTTP) and an OpenSearch RAG engine.
  • Scalable backend powered by Ray.

Maintenance & Community

Licensing & Compatibility

  • PyPI package sycamore-ai is released under the Apache 2.0 license.

Limitations & Caveats

  • Primarily designed for Linux and Mac OS; Windows support is not explicitly mentioned.
  • Relies on Aryn DocParse for advanced document parsing, which has a cloud API option and a local option.
Health Check
Last commit

19 hours ago

Responsiveness

1 day

Pull Requests (30d)
33
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

hyperDB by jdagdelen

0%
1k
Local vector database for LLM agent applications
created 2 years ago
updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

dolma by allenai

0.4%
1k
Toolkit for curating datasets for language model pre-training
created 2 years ago
updated 1 day ago
Feedback? Help us improve.