DataFlow-EDU by Heartune

Generate exercise sets from textbooks

Created 2 months ago

304 stars

Top 87.7% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> DataFlow-EDU addresses the challenge of automatically generating high-quality, structured educational question banks and benchmarks from PDF textbooks. It targets educators and researchers by providing an end-to-end, operator-based pipeline that transforms raw teaching materials into usable datasets for training and evaluating large language models. The primary benefit is the automation of a labor-intensive process, enabling scalable creation of domain-specific educational content.

How It Works

The project employs a DataFlow and PyTorch-inspired, operator-and-pipeline architecture for semi-automatic educational dataset generation. It ingests PDF documents, utilizing MinerU for multimodal document parsing and OCR. Subsequent operators handle content slicing, question generation, dynamic question type balancing, and multi-stage cleaning (ambiguity and domain refinement). LLM-as-a-Judge is integrated for multi-dimensional quality review, ensuring low hallucination and balanced distribution. The pipeline is designed to be modular and command-line interactive, allowing user monitoring and intervention, with a Vue.js-based WebUI for enhanced management.

Quick Start & Requirements

Primary install / run command: Install local DataFlow package via pip install -e . in the DataFlow directory. Run pipeline via python -m dataflow_edu.edu_data_pipeline in the project root.
Non-default prerequisites: Local DataFlow package, LLM configuration (.llm_config.json), Node.js for WebUI.
Estimated setup time or resource footprint: Not specified.
Links: GitHub Repo: https://github.com/Heartune/DataFlow-EDU. Demo slides are located in slide-deck/dataflow-edu/. WebUI default access: http://127.0.0.1:5173 (local).

Highlighted Details

End-to-end pipeline for automated generation of structured educational question banks and benchmarks from PDF textbooks.
Modular, operator-based architecture inspired by DataFlow, supporting command-line interaction and a Vue.js WebUI.
Integrates MinerU for multimodal document parsing and LLM-as-a-Judge for quality assessment and refinement.
Leverages extensive experience from building large-scale domain-specific corpora (e.g., ROBOTheory-79k, CyberSecCorpus).

Maintenance & Community

The README does not provide specific details on maintainers, community channels (like Discord/Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The README does not explicitly state the project's license.

Limitations & Caveats

The system is semi-automatic, requiring human monitoring and intervention. Development is ongoing, with planned improvements including direct PDF parsing (instead of image conversion) and enhanced WebUI features like drag-and-drop controls and real-time progress previews. The project is actively being developed, indicated by "TODO" items in the README.

DataFlow-EDU by Heartune

Explore Similar Projects

direct-rag-learning by mangopy

ContinualLM by UIC-Liu-Lab

Awesome-LLM4IE-Papers by quqxui

finetune by IndicoDataSolutions

TextBox by RUCAIBox

Awesome-LLMs-Datasets by lmmlzn

BERT-keras by Separius

together-cookbook by togethercomputer

awesome-LLM-resources by WangRongsheng

lm-evaluation-harness by EleutherAI

Hands-On-Large-Language-Models by HandsOnLLM

funNLP by fighting41love