data-analytics-golden-demo  by GoogleCloudPlatform

End-to-end data analytics demo on Google Cloud, pre-configured and ready to run

created 3 years ago
259 stars

Top 98.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an end-to-end demonstration of Google Cloud's data analytics stack, targeting engineers and data professionals who need to understand and showcase the integration of various services. It offers a fully configured environment with 70-700 million rows of data to illustrate real-world performance and scalability, enabling users to explore different data processing and analysis paths.

How It Works

The system orchestrates data pipelines using Airflow, ensuring services communicate securely over private IP addresses. It leverages a comprehensive suite of Google Cloud services, including BigQuery for analytics, Dataplex for data lake management, BQML for machine learning, and BigLake for accessing data across clouds (AWS, Azure). Recent updates include migrating from text-bison to Gemini Pro and replacing App Engine with Cloud Run.

Quick Start & Requirements

  • Installation: Clone the repository and run deployment scripts (deploy.sh or deploy-use-existing-project-non-org-admin.sh).
  • Prerequisites: Google Cloud account with appropriate IAM roles (Organization Administrator or Project Owner), Google Cloud CLI, Terraform, Git, Curl, and jq. Deployment may require disabling specific organization policies.
  • Resources: Deployment can incur costs; running costs are estimated at $2/day without Composer, or $20/day with Composer.
  • Documentation: Demo Artifacts Table of Contents

Highlighted Details

  • Demonstrates AI Lakehouse concepts with Gemini LLMs integrated directly into BigQuery for customer review analysis.
  • Features cross-cloud data querying and joins using BigQuery OMNI with AWS and Azure.
  • Includes extensive examples of BigQuery features like Materialized Views, JSON data types, BQML, BigSearch, BigLake, and Dataform.
  • Showcases data ingestion and processing at scale, including streaming data and handling 5-50 billion rows.

Maintenance & Community

This is an official Google Cloud Platform repository, indicating active maintenance and support from Google. Further demos for related technologies like Chocolate AI and Data Beans are linked.

Licensing & Compatibility

The repository is released under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Deployment requires significant Google Cloud administrative privileges and can be complex due to organization policy configurations. Some features, like BigSearch on 50 billion rows, are noted as internal. Cloud Shell deployment is noted as potentially problematic.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.