llm-driven-data-engineering  by DataExpert-io

LLM-driven data engineering examples and tutorials

Created 2 years ago
1,087 stars

Top 35.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive overview of LLM-driven data engineering concepts, targeting engineers and developers looking to leverage large language models for data tasks. It offers practical guidance and code examples for building data pipelines and applications powered by LLMs, with a focus on practical implementation and business value.

How It Works

The project utilizes a modular approach, breaking down LLM-driven data engineering into distinct daily modules. It integrates popular libraries like LangChain and LlamaIndex for LLM orchestration and retrieval-augmented generation (RAG). Key concepts covered include auto-generating SQL queries, creating business value with LLMs, and building custom RAG applications like "ZachGPT."

Quick Start & Requirements

  • Install: uv sync (recommended) or pip install .
  • Prerequisites: OpenAI API key, Pinecone account and API key, PostgreSQL database.
  • Environment Variables: OPENAI_API_KEY, PINECONE_API_KEY, LANGCHAIN_DATABASE_URL (if not using local dump).
  • Data: halo_data_dump.dump file for local PostgreSQL setup.
  • Resources: Requires setting up a PostgreSQL instance and obtaining API keys.
  • Links: Day 1 Lecture, Day 1 Lab, Day 2 Lecture, Day 2 Lab, Day 3 Lecture, Day 4 Lecture, Day 4 Lab.

Highlighted Details

  • Demonstrates LLM-driven SQL query generation using LangChain.
  • Provides a practical guide to building RAG applications.
  • Focuses on extracting business value through LLM integration.
  • Includes lecture and lab videos for each day's topic.

Maintenance & Community

The project highlights key individuals in the LLM data engineering space to follow, including Li Yin, Chip Huyen, and Zach Wilson. Specific community links (Discord/Slack) or roadmap details are not provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on external services requiring API keys (OpenAI, Pinecone), which may incur costs. Setup for local development requires familiarity with PostgreSQL and environment variable management. Some labs depend on other repositories not directly included.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

vanna by vanna-ai

0.4%
20k
Python RAG framework for SQL generation
Created 2 years ago
Updated 5 months ago
Feedback? Help us improve.