cocoindex  by cocoindex-io

Real-time data transformation framework for AI indexing

created 5 months ago
2,323 stars

Top 20.0% on sourcepulse

GitHubView on GitHub
Project Summary

CocoIndex is an open-source framework designed for real-time data transformation and indexing, particularly for AI applications. It enables users to define data processing pipelines that automatically maintain updated indexes based on source data changes, minimizing computational overhead. The target audience includes AI engineers and data scientists who need efficient and fresh data indexing for tasks like semantic search or knowledge graph construction.

How It Works

CocoIndex employs a declarative approach to define data transformation and indexing workflows. Users specify data sources, transformation steps (e.g., text splitting, embedding generation), and target indexes. The framework then manages the execution and incremental updates of these indexes, ensuring they remain synchronized with the source data. This is achieved through a sophisticated change detection and re-processing mechanism, optimizing for minimal computation on updates.

Quick Start & Requirements

  • Install: pip install -U cocoindex
  • Prerequisites: PostgreSQL with the pgvector extension, or Docker Compose for setting up a PostgreSQL instance.
  • Setup: The documentation and a quick start video tutorial are available. A Docker Compose configuration for PostgreSQL is provided.
  • Documentation: https://cocoindex.io/docs/getting_started/quickstart

Highlighted Details

  • Supports custom transformation logic and incremental updates for data indexing.
  • Offers pre-built examples for text embedding, code embedding, PDF processing, and more.
  • Integrates with vector databases like PostgreSQL (with pgvector) and Qdrant.
  • Enables extraction of structured information from documents using LLMs.

Maintenance & Community

The project is active, with CI/CD pipelines for releases and a Discord community for support and discussion. Contributions are welcomed via a contributing guide.

Licensing & Compatibility

CocoIndex is licensed under the Apache 2.0 license, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The framework's primary dependency is PostgreSQL with pgvector, which might be a consideration for environments without this setup. Specific performance benchmarks or scalability limits are not detailed in the README.

Health Check
Last commit

1 day ago

Responsiveness

1+ week

Pull Requests (30d)
136
Issues (30d)
20
Star History
1,383 stars in the last 90 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

pgvector-node by pgvector

0.8%
399
Node.js library for pgvector support
created 4 years ago
updated 2 weeks ago
Feedback? Help us improve.