Python library for multimodal data representation, transmission, storage, and retrieval
Top 15.9% on sourcepulse
DocArray is a Python library designed for the representation, transmission, storage, and retrieval of multimodal data, targeting developers of multimodal AI applications. It offers a flexible, Pydantic-based schema for defining data structures, enabling seamless integration with ML frameworks and web services, and simplifying data handling for training, serving, and parsing tasks.
How It Works
DocArray leverages Pydantic for its data modeling, allowing users to define custom schemas with ML-specific types like TorchTensor
and ImageUrl
, including tensor shape validation. It provides DocVec
and DocList
for efficient batch processing and data management, respectively. Data can be serialized to Protobuf or JSON for transmission via gRPC or HTTP, and integrated with various vector databases (Weaviate, Qdrant, etc.) for similarity search.
Quick Start & Requirements
pip install -U docarray
.Highlighted Details
DocVec
and DocList
for efficient batch processing and data management.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
DocArray versioning has introduced significant changes; users of older versions (<=0.21) must explicitly install the older version to maintain compatibility. The README notes that TensorFlowTensor
is not a subclass of tf.Tensor
, requiring access via a .tensor
attribute for direct TensorFlow operations.
1 month ago
Inactive