docarray  by docarray

Python library for multimodal data representation, transmission, storage, and retrieval

Created 3 years ago
3,101 stars

Top 15.4% on SourcePulse

GitHubView on GitHub
Project Summary

DocArray is a Python library designed for the representation, transmission, storage, and retrieval of multimodal data, targeting developers of multimodal AI applications. It offers a flexible, Pydantic-based schema for defining data structures, enabling seamless integration with ML frameworks and web services, and simplifying data handling for training, serving, and parsing tasks.

How It Works

DocArray leverages Pydantic for its data modeling, allowing users to define custom schemas with ML-specific types like TorchTensor and ImageUrl, including tensor shape validation. It provides DocVec and DocList for efficient batch processing and data management, respectively. Data can be serialized to Protobuf or JSON for transmission via gRPC or HTTP, and integrated with various vector databases (Weaviate, Qdrant, etc.) for similarity search.

Quick Start & Requirements

  • Install via pip install -U docarray.
  • Supports NumPy, PyTorch, TensorFlow, and JAX.
  • Integrates with FastAPI, Jina, and multiple vector databases.
  • Official documentation: https://docarray.jina.ai/

Highlighted Details

  • Native support for major ML frameworks (PyTorch, TensorFlow, JAX, NumPy).
  • Pydantic-based schema definition with tensor shape validation.
  • DocVec and DocList for efficient batch processing and data management.
  • Seamless integration with FastAPI for model serving.
  • Support for vector databases like Weaviate, Qdrant, Elasticsearch, Redis, and HNSWLib for similarity search.
  • Data transmission via JSON over HTTP or Protobuf over gRPC.

Maintenance & Community

  • A sandbox project within the LF AI & Data Foundation.
  • Discord server available for community support.
  • Roadmap available for future development insights.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Fully compatible with commercial and closed-source applications.

Limitations & Caveats

DocArray versioning has introduced significant changes; users of older versions (<=0.21) must explicitly install the older version to maintain compatibility. The README notes that TensorFlowTensor is not a subclass of tf.Tensor, requiring access via a .tensor attribute for direct TensorFlow operations.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

LightRAG by HKUDS

1.2%
21k
RAG framework for fast, simple retrieval-augmented generation
Created 11 months ago
Updated 2 days ago
Feedback? Help us improve.