Python library for LLM-powered data cleaning and curation
Top 63.6% on sourcepulse
Databonsai is a Python library designed to leverage Large Language Models (LLMs) for efficient and robust data cleaning and curation. It offers a suite of tools for tasks like categorization, transformation, and extraction, aiming to simplify complex data manipulation for developers and data scientists.
How It Works
Databonsai utilizes LLMs to perform data operations by abstracting away the complexities of prompt engineering and API interactions. It supports providers like OpenAI and Anthropic, with a focus on cost-effectiveness (recommending Anthropic's Haiku model). Key features include batch processing for token savings, adaptive batch sizing for larger datasets, and built-in retry logic with exponential backoff to handle API errors and invalid responses, ensuring more reliable data processing pipelines.
Quick Start & Requirements
pip install databonsai
.env
file or arguments.Highlighted Details
apply_to_column_autobatch
) for optimizing token usage and handling errors.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive