DataFlow-EDU  by Heartune

Generate exercise sets from textbooks

Created 2 months ago
304 stars

Top 87.7% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DataFlow-EDU addresses the challenge of automatically generating high-quality, structured educational question banks and benchmarks from PDF textbooks. It targets educators and researchers by providing an end-to-end, operator-based pipeline that transforms raw teaching materials into usable datasets for training and evaluating large language models. The primary benefit is the automation of a labor-intensive process, enabling scalable creation of domain-specific educational content.

How It Works

The project employs a DataFlow and PyTorch-inspired, operator-and-pipeline architecture for semi-automatic educational dataset generation. It ingests PDF documents, utilizing MinerU for multimodal document parsing and OCR. Subsequent operators handle content slicing, question generation, dynamic question type balancing, and multi-stage cleaning (ambiguity and domain refinement). LLM-as-a-Judge is integrated for multi-dimensional quality review, ensuring low hallucination and balanced distribution. The pipeline is designed to be modular and command-line interactive, allowing user monitoring and intervention, with a Vue.js-based WebUI for enhanced management.

Quick Start & Requirements

  • Primary install / run command: Install local DataFlow package via pip install -e . in the DataFlow directory. Run pipeline via python -m dataflow_edu.edu_data_pipeline in the project root.
  • Non-default prerequisites: Local DataFlow package, LLM configuration (.llm_config.json), Node.js for WebUI.
  • Estimated setup time or resource footprint: Not specified.
  • Links: GitHub Repo: https://github.com/Heartune/DataFlow-EDU. Demo slides are located in slide-deck/dataflow-edu/. WebUI default access: http://127.0.0.1:5173 (local).

Highlighted Details

  • End-to-end pipeline for automated generation of structured educational question banks and benchmarks from PDF textbooks.
  • Modular, operator-based architecture inspired by DataFlow, supporting command-line interaction and a Vue.js WebUI.
  • Integrates MinerU for multimodal document parsing and LLM-as-a-Judge for quality assessment and refinement.
  • Leverages extensive experience from building large-scale domain-specific corpora (e.g., ROBOTheory-79k, CyberSecCorpus).

Maintenance & Community

The README does not provide specific details on maintainers, community channels (like Discord/Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license.

Limitations & Caveats

The system is semi-automatic, requiring human monitoring and intervention. Development is ongoing, with planned improvements including direct PDF parsing (instead of image conversion) and enhanced WebUI features like drag-and-drop controls and real-time progress previews. The project is actively being developed, indicated by "TODO" items in the README.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
10
Star History
100 stars in the last 30 days

Explore Similar Projects

Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
3 more.

Hands-On-Large-Language-Models by HandsOnLLM

0.4%
27k
Code examples for "Hands-On Large Language Models" book
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.