docarray by docarray

Python library for multimodal data representation, transmission, storage, and retrieval

Created 4 years ago

3,111 stars

Top 15.2% on SourcePulse

View on GitHub

7 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Johannes Hagemann

Cofounder of Prime Intellect

and 3 more!

Project Summary

DocArray is a Python library designed for the representation, transmission, storage, and retrieval of multimodal data, targeting developers of multimodal AI applications. It offers a flexible, Pydantic-based schema for defining data structures, enabling seamless integration with ML frameworks and web services, and simplifying data handling for training, serving, and parsing tasks.

How It Works

DocArray leverages Pydantic for its data modeling, allowing users to define custom schemas with ML-specific types like TorchTensor and ImageUrl, including tensor shape validation. It provides DocVec and DocList for efficient batch processing and data management, respectively. Data can be serialized to Protobuf or JSON for transmission via gRPC or HTTP, and integrated with various vector databases (Weaviate, Qdrant, etc.) for similarity search.

Quick Start & Requirements

Install via pip install -U docarray.
Supports NumPy, PyTorch, TensorFlow, and JAX.
Integrates with FastAPI, Jina, and multiple vector databases.
Official documentation: https://docarray.jina.ai/

Highlighted Details

Native support for major ML frameworks (PyTorch, TensorFlow, JAX, NumPy).
Pydantic-based schema definition with tensor shape validation.
DocVec and DocList for efficient batch processing and data management.
Seamless integration with FastAPI for model serving.
Support for vector databases like Weaviate, Qdrant, Elasticsearch, Redis, and HNSWLib for similarity search.
Data transmission via JSON over HTTP or Protobuf over gRPC.

Maintenance & Community

A sandbox project within the LF AI & Data Foundation.
Discord server available for community support.
Roadmap available for future development insights.

Licensing & Compatibility

Licensed under Apache License 2.0.
Fully compatible with commercial and closed-source applications.

Limitations & Caveats

DocArray versioning has introduced significant changes; users of older versions (<=0.21) must explicitly install the older version to maintain compatibility. The README notes that TensorFlowTensor is not a subclass of tf.Tensor, requiring access via a .tensor attribute for direct TensorFlow operations.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days