cosdata  by cosdata

Advanced AI data platform for next-gen search pipelines

Created 1 year ago
340 stars

Top 81.1% on SourcePulse

GitHubView on GitHub
Project Summary

Cosdata is a next-generation retrieval infrastructure designed for AI-native applications, addressing the need for relevance beyond simple vector similarity. It targets developers building advanced search pipelines, offering a relevance-first architecture that combines multiple search modalities to enhance AI project data management and retrieval quality while reducing compute requirements.

How It Works

Cosdata employs a relevance-first architecture that unifies multiple search modalities: BM25 full-text search, HNSW dense vectors, and SPLADE learned sparse embeddings. This multi-modal approach, combined with context-aware capabilities like geofencing and hierarchical organization, aims to optimize for actual user satisfaction rather than just mathematical proximity. Its enterprise-grade design features colocated storage, streaming ingestion, and transactional versioning for robust AI workloads.

Quick Start & Requirements

  • Installation:
    • Linux: curl -sL https://cosdata.io/install.sh | bash
    • macOS & Windows: Docker (v20.10+) - docker pull cosdataio/cosdata:latest then docker run ...
  • Build from Source: Requires Git (v2.0+), Rust (v1.81.0+), and a C++ compiler (GCC ≥ 4.8 or Clang ≥ 3.4).
  • Testing: Requires Python 3.8+ and the uv CLI.
  • HTTPS: Detailed instructions are provided for generating self-signed certificates and configuring TLS.
  • Documentation: Client SDK documentation links are provided.

Highlighted Details

  • Performance: Claims 60-120% reduction in compute requirements and 20-50% improvement in retrieval quality (NDCG@10). Offers up to 12x faster indexing than Elasticsearch and sub-100ms response times.
  • Multi-Modal Search: Seamlessly integrates BM25, HNSW dense vectors, and SPLADE sparse embeddings.
  • Enterprise-Grade: Features colocated storage, transactional versioning ("time travel"), streaming ingestion, and production-ready security (end-to-end encryption, RBAC).
  • Scalability: Engineered for unbounded scalability with predictable, near-linear query performance.
  • Optimization: Supports advanced vector quantization (scalar, product) and auto-configuration of hyperparameters.

Maintenance & Community

Contributions are welcomed via issues and pull requests, with guidelines in CONTRIBUTING.md. Community engagement is encouraged through Discord, email (contact@cosdata.io), and GitHub issues/discussions.

Licensing & Compatibility

The provided README does not explicitly state the software's license. This omission requires further investigation for compatibility, especially for commercial use or closed-source integration.

Limitations & Caveats

Documentation for testing with real-world datasets is incomplete, with placeholders for download links and configuration instructions. The default HTTP mode is insecure and not recommended for production environments.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
108 stars in the last 30 days

Explore Similar Projects

Starred by Chang She Chang She(Cofounder of LanceDB), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
11 more.

lancedb by lancedb

0.6%
8k
Embedded retrieval engine for multimodal AI
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.