llm-driven-data-engineering  by DataExpert-io

LLM-driven data engineering examples and tutorials

created 1 year ago
1,061 stars

Top 36.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive overview of LLM-driven data engineering concepts, targeting engineers and developers looking to leverage large language models for data tasks. It offers practical guidance and code examples for building data pipelines and applications powered by LLMs, with a focus on practical implementation and business value.

How It Works

The project utilizes a modular approach, breaking down LLM-driven data engineering into distinct daily modules. It integrates popular libraries like LangChain and LlamaIndex for LLM orchestration and retrieval-augmented generation (RAG). Key concepts covered include auto-generating SQL queries, creating business value with LLMs, and building custom RAG applications like "ZachGPT."

Quick Start & Requirements

  • Install: uv sync (recommended) or pip install .
  • Prerequisites: OpenAI API key, Pinecone account and API key, PostgreSQL database.
  • Environment Variables: OPENAI_API_KEY, PINECONE_API_KEY, LANGCHAIN_DATABASE_URL (if not using local dump).
  • Data: halo_data_dump.dump file for local PostgreSQL setup.
  • Resources: Requires setting up a PostgreSQL instance and obtaining API keys.
  • Links: Day 1 Lecture, Day 1 Lab, Day 2 Lecture, Day 2 Lab, Day 3 Lecture, Day 4 Lecture, Day 4 Lab.

Highlighted Details

  • Demonstrates LLM-driven SQL query generation using LangChain.
  • Provides a practical guide to building RAG applications.
  • Focuses on extracting business value through LLM integration.
  • Includes lecture and lab videos for each day's topic.

Maintenance & Community

The project highlights key individuals in the LLM data engineering space to follow, including Li Yin, Chip Huyen, and Zach Wilson. Specific community links (Discord/Slack) or roadmap details are not provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on external services requiring API keys (OpenAI, Pinecone), which may incur costs. Setup for local development requires familiarity with PostgreSQL and environment variable management. Some labs depend on other repositories not directly included.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
61 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.