docarray  by docarray

Python library for multimodal data representation, transmission, storage, and retrieval

created 3 years ago
3,084 stars

Top 15.9% on sourcepulse

GitHubView on GitHub
Project Summary

DocArray is a Python library designed for the representation, transmission, storage, and retrieval of multimodal data, targeting developers of multimodal AI applications. It offers a flexible, Pydantic-based schema for defining data structures, enabling seamless integration with ML frameworks and web services, and simplifying data handling for training, serving, and parsing tasks.

How It Works

DocArray leverages Pydantic for its data modeling, allowing users to define custom schemas with ML-specific types like TorchTensor and ImageUrl, including tensor shape validation. It provides DocVec and DocList for efficient batch processing and data management, respectively. Data can be serialized to Protobuf or JSON for transmission via gRPC or HTTP, and integrated with various vector databases (Weaviate, Qdrant, etc.) for similarity search.

Quick Start & Requirements

  • Install via pip install -U docarray.
  • Supports NumPy, PyTorch, TensorFlow, and JAX.
  • Integrates with FastAPI, Jina, and multiple vector databases.
  • Official documentation: https://docarray.jina.ai/

Highlighted Details

  • Native support for major ML frameworks (PyTorch, TensorFlow, JAX, NumPy).
  • Pydantic-based schema definition with tensor shape validation.
  • DocVec and DocList for efficient batch processing and data management.
  • Seamless integration with FastAPI for model serving.
  • Support for vector databases like Weaviate, Qdrant, Elasticsearch, Redis, and HNSWLib for similarity search.
  • Data transmission via JSON over HTTP or Protobuf over gRPC.

Maintenance & Community

  • A sandbox project within the LF AI & Data Foundation.
  • Discord server available for community support.
  • Roadmap available for future development insights.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Fully compatible with commercial and closed-source applications.

Limitations & Caveats

DocArray versioning has introduced significant changes; users of older versions (<=0.21) must explicitly install the older version to maintain compatibility. The README notes that TensorFlowTensor is not a subclass of tf.Tensor, requiring access via a .tensor attribute for direct TensorFlow operations.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
37 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

R2R by SciPhi-AI

0.3%
7k
Production-ready AI retrieval system with agentic RAG
created 1 year ago
updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 18 hours ago
Feedback? Help us improve.