dataset-generator  by metabase

AI dataset generator for realistic data

created 1 month ago
651 stars

Top 52.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an AI-powered tool for generating realistic datasets for demos, learning, and dashboards, targeting developers and data analysts. It simplifies data creation through a conversational prompt builder and integrates with Metabase for immediate data exploration, offering free CSV/SQL exports after an initial low-cost preview.

How It Works

The core of the generator uses OpenAI's GPT-4o to interpret user prompts and create a detailed data specification (schema, business rules). Actual data rows are then generated locally using the Faker library based on this LLM-generated spec. This approach ensures that only the initial preview or schema definition incurs OpenAI costs; subsequent data exports are free and instantaneous.

Quick Start & Requirements

  • Install via npm install and run with npm run dev.
  • Prerequisites: Docker, OpenAI API key.
  • Setup involves cloning the repo, copying .env.example to .env.local, and adding the OpenAI API key.
  • The application runs at http://localhost:3000.
  • Metabase is launched on-demand via Docker.
  • Official Docs: https://github.com/metabase/dataset-generator

Highlighted Details

  • Conversational prompt builder for defining business type, schema, and row count.
  • Real-time data preview in the browser.
  • Exports data as CSV (single or multi-table ZIP) or SQL inserts.
  • One-click Metabase launch for data exploration.
  • Supports "One Big Table" (OBT) and "Star Schema" data structures.

Maintenance & Community

The project is maintained by Metabase. Further community interaction details are not specified in the README.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The generation process relies on an external OpenAI API key, incurring costs for data previews. While data exports are free, the quality and realism of the generated data are dependent on the LLM's interpretation of the prompt and the Faker library's capabilities.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
654 stars in the last 90 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Professor at CMU; ML Researcher at Apple) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0.3%
717
Scientific tool for latent space investigation
created 2 years ago
updated 2 months ago
Starred by Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

chatgpt-pgvector by gannonh

0%
938
Domain-specific chat completions app
created 2 years ago
updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.1%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 1 day ago
Feedback? Help us improve.