streamforge-ai  by cchenax

Real-time AI data pipeline for ingestion and feature generation

Created 2 months ago
309 stars

Top 86.7% on SourcePulse

GitHubView on GitHub
Project Summary

StreamForge AI offers an open-source, real-time data pipeline platform tailored for AI and analytics workloads. It addresses the need for efficient Change Data Capture (CDC) ingestion, stream processing for feature generation, object storage integration, and optimized data prefetching for machine learning tasks. The platform targets engineers and researchers seeking a minimal, realistic, and locally deployable environment to explore modern data pipeline architectures and best practices.

How It Works

The architecture is composed of distinct layers: an ingestion layer using Debezium to capture row-level changes from MySQL/Postgres and publish them to Kafka; a streaming layer powered by Apache Flink for consuming CDC events, performing transformations, computing aggregations, and writing processed data to storage; and a storage layer initially targeting MinIO/S3-compatible object storage, with future plans for Iceberg table sinks. A specialized prefetch engine analyzes access patterns to proactively load data into a hot cache, mitigating ML workload cold starts. This modular design leverages robust, widely-adopted streaming and messaging technologies.

Quick Start & Requirements

A runnable local stack is available via Docker Compose for an end-to-end demo (MySQL -> Debezium -> Kafka -> Flink -> MinIO), accessible at deploy/cdc-flink-minio-demo/docker-compose.yml. Prerequisites include Docker and Docker Compose. The demo setup utilizes specific database instances like MySQL.

Highlighted Details

  • End-to-end demonstration of a CDC ingestion pipeline from MySQL to Flink processing and MinIO/S3 storage.
  • A novel storage-aware prefetching engine designed to optimize ML training cold starts by anticipating data access patterns.
  • Modular architecture leveraging Debezium, Apache Flink, and MinIO/S3, with planned Iceberg sink support.
  • Progressive roadmap towards lakehouse features, platformization, and deeper ML/AI integrations.

Maintenance & Community

The project provides a contribution guide (CONTRIBUTING.md) and links to GitHub issues for tracking progress and opening discussions. The roadmap is detailed, outlining planned features up to v0.8. Relevant links include: GitHub issues (https://github.com/cchenax/streamforge-ai/issues), GitHub projects (https://github.com/cchenax/streamforge-ai/projects), new issue creation (https://github.com/cchenax/streamforge-ai/issues/new/choose), and open PRs (https://github.com/cchenax/streamforge-ai/pulls).

Licensing & Compatibility

The README does not explicitly state the project's license. Users should verify licensing terms before adoption, especially concerning commercial use or integration with closed-source systems.

Limitations & Caveats

The project explicitly states it is not a full production-grade multi-tenant platform and does not include enterprise authentication/authorization in v0.1. Current focus is on local development and demo environments, with advanced features like Kubernetes operators, fine-grained RBAC, and enterprise readiness planned for later roadmap versions (v0.8).

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
20
Issues (30d)
29
Star History
185 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

data-juicer by datajuicer

0.4%
6k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 10 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
3 more.

risingwave by risingwavelabs

0.2%
9k
Stream processing and serving for AI agents and real-time data applications
Created 4 years ago
Updated 15 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

pathway by pathwaycom

0.1%
63k
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG
Created 3 years ago
Updated 13 hours ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.1%
22k
Access and process large AI datasets efficiently
Created 6 years ago
Updated 1 day ago
Feedback? Help us improve.