chdb  by chdb-io

In-process OLAP SQL engine for Python data analysis

Created 2 years ago
2,579 stars

Top 18.0% on SourcePulse

GitHubView on GitHub
Project Summary

chDB provides an in-process OLAP SQL engine, leveraging the power of ClickHouse for high-performance analytical queries directly within Python applications. It targets data scientists, engineers, and researchers who need to query diverse data formats without the overhead of setting up and managing a separate ClickHouse instance. The primary benefit is simplified data analysis workflows, reduced data copying, and seamless integration with Python data ecosystems.

How It Works

chDB embeds ClickHouse's core OLAP engine, enabling SQL queries directly within the Python process. It minimizes data transfer overhead between C++ and Python using python memoryview. The engine supports a vast array of input and output formats, including Parquet, CSV, JSON, Arrow, and ORC, alongside compliance with the Python DB API 2.0. Advanced features include support for User-Defined Functions (UDFs) and efficient streaming query processing for large datasets. It also offers AI-assisted SQL generation, translating natural language prompts into executable SQL queries.

Quick Start & Requirements

  • Installation: pip install chdb
  • Prerequisites: Python 3.8+ on macOS and Linux (x86_64 and ARM64).
  • Usage: Can be run via the command line (python3 -m chdb "SQL" [OutputFormat]) or programmatically using the chdb Python API (chdb.connect(), chdb.query()).
  • Documentation: Examples are available in the repository (examples/, tests/), and project documentation is linked via the README.

Highlighted Details

  • In-process OLAP SQL engine powered by ClickHouse.
  • Supports 60+ data formats including Parquet, CSV, JSON, Arrow, and ORC.
  • Full Python DB API 2.0 compliance.
  • Seamless integration with Pandas DataFrames and PyArrow Tables.
  • Efficient streaming query processing for large datasets with constant memory usage.
  • User-Defined Functions (UDFs) for custom logic.
  • AI-assisted SQL generation from natural language prompts.
  • Benchmarks available comparing embedded engines and DataFrame performance.

Maintenance & Community

The project maintains an active community presence via Discord (https://discord.gg/D2Daa2fM5K) and Twitter (@chdb). Contributions are welcomed for testing, documentation, and code improvements. Bindings for other languages are also encouraged.

Licensing & Compatibility

chDB is released under the Apache 2.0 license. This permissive license allows for commercial use, modification, and distribution, including integration within closed-source applications.

Limitations & Caveats

Currently, chDB officially supports Python 3.8+ on macOS and Linux; Windows support is not explicitly mentioned. User-Defined Functions (UDFs) must be stateless. Streaming queries require careful resource management (explicit close() or with statement) to prevent blocking subsequent operations. AI-assisted SQL generation requires proper configuration of AI providers and API keys.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
17
Issues (30d)
14
Star History
39 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

vanna by vanna-ai

0.5%
22k
Python RAG framework for SQL generation
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
12 more.

mindsdb by mindsdb

0.2%
38k
AI query engine for federated data sources
Created 7 years ago
Updated 1 day ago
Feedback? Help us improve.