databonsai  by alvin-r

Python library for LLM-powered data cleaning and curation

created 1 year ago
492 stars

Top 63.6% on sourcepulse

GitHubView on GitHub
Project Summary

Databonsai is a Python library designed to leverage Large Language Models (LLMs) for efficient and robust data cleaning and curation. It offers a suite of tools for tasks like categorization, transformation, and extraction, aiming to simplify complex data manipulation for developers and data scientists.

How It Works

Databonsai utilizes LLMs to perform data operations by abstracting away the complexities of prompt engineering and API interactions. It supports providers like OpenAI and Anthropic, with a focus on cost-effectiveness (recommending Anthropic's Haiku model). Key features include batch processing for token savings, adaptive batch sizing for larger datasets, and built-in retry logic with exponential backoff to handle API errors and invalid responses, ensuring more reliable data processing pipelines.

Quick Start & Requirements

  • Install via pip: pip install databonsai
  • Requires API keys for LLM providers (e.g., OpenAI, Anthropic), configurable via .env file or arguments.
  • Official documentation and examples are available.

Highlighted Details

  • Supports data categorization, transformation, and structured data extraction.
  • Features adaptive batching (apply_to_column_autobatch) for optimizing token usage and handling errors.
  • Includes retry logic for API errors and response validation.
  • Tracks token usage for cost estimation with supported providers.

Maintenance & Community

  • The project is actively maintained, with ongoing development of new tools and LLM provider integrations.
  • Community links or specific contributor details are not prominently featured in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Users should verify licensing for commercial use.

Limitations & Caveats

  • The library is primarily focused on LLM-based data cleaning; other data processing needs may require complementary tools.
  • While adaptive batching is a feature, optimal batch sizes are dependent on the LLM model used and input complexity.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.