llm-driven-data-engineering by DataExpert-io

LLM-driven data engineering examples and tutorials

Created 2 years ago

1,110 stars

Top 34.4% on SourcePulse

Project Summary

This repository provides a comprehensive overview of LLM-driven data engineering concepts, targeting engineers and developers looking to leverage large language models for data tasks. It offers practical guidance and code examples for building data pipelines and applications powered by LLMs, with a focus on practical implementation and business value.

How It Works

The project utilizes a modular approach, breaking down LLM-driven data engineering into distinct daily modules. It integrates popular libraries like LangChain and LlamaIndex for LLM orchestration and retrieval-augmented generation (RAG). Key concepts covered include auto-generating SQL queries, creating business value with LLMs, and building custom RAG applications like "ZachGPT."

Quick Start & Requirements

Install: uv sync (recommended) or pip install .
Prerequisites: OpenAI API key, Pinecone account and API key, PostgreSQL database.
Environment Variables: OPENAI_API_KEY, PINECONE_API_KEY, LANGCHAIN_DATABASE_URL (if not using local dump).
Data: halo_data_dump.dump file for local PostgreSQL setup.
Resources: Requires setting up a PostgreSQL instance and obtaining API keys.
Links: Day 1 Lecture, Day 1 Lab, Day 2 Lecture, Day 2 Lab, Day 3 Lecture, Day 4 Lecture, Day 4 Lab.

Highlighted Details

Demonstrates LLM-driven SQL query generation using LangChain.
Provides a practical guide to building RAG applications.
Focuses on extracting business value through LLM integration.
Includes lecture and lab videos for each day's topic.

Maintenance & Community

The project highlights key individuals in the LLM data engineering space to follow, including Li Yin, Chip Huyen, and Zach Wilson. Specific community links (Discord/Slack) or roadmap details are not provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on external services requiring API keys (OpenAI, Pinecone), which may incur costs. Setup for local development requires familiarity with PostgreSQL and environment variable management. Some labs depend on other repositories not directly included.

llm-driven-data-engineering by DataExpert-io

Explore Similar Projects

pg_gpt by cloudquery

pgassistant by nexsol-technologies

OmniSQL by RUCKBReasoning

MindSQL by Mindinventory

XiYan-SQL by XGenerationLab

korvus by postgresml

mcp-alchemy by runekaagaard

rookie_text2data by jaguarliuu

sql-eval by defog-ai

trustfall by obi1kenobi

Spider2 by xlang-ai

vanna by vanna-ai